php: better way to split string into associative array

Question

I have a string like this:

"ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999"

and my goal is to split into an associative array:

Array
(
    [ALARM_ID/I4] => 1010001
    [ALARM_STATE/U4] => eventcode
    [ALARM_TEXT/A] => WMR_MAP_EXPORT
    [LOTS/A[1]] => [ STEFANO ]
    [ALARM_STATE/U1] => 1
    [WAFER/U4] => 1
    [VI_KLARF_MAP/A] => /test/klarf.map
    [KLARF_STEPID/A] => StepID
    [KLARF_DEVICEID/A] => DeviceID
    [KLARF_EQUIPMENTID/A] => EquipmentID
    [KLARF_SETUP_ID/A] => SetupID
    [RULE_ID/U4] => 1234
    [RULE_FORMULA_EXPRESSION/A] => a < b && c > d
    [RULE_FORMULA_TEXT/A] => 1 < 0 && 2 > 3
    [RULE_FORMULA_RESULT/A] => FAIL
    [TIMESTAMP/A] => 10-Nov-2020 09:10:11 99999999
)

The unique (but maybe dirties) way that I found is through this script:

<?php
$msg = "ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999";
$split = explode("=", $msg);
foreach($split as $k => $s) {
    $s = explode(" ", $s);
    $keys[] = array_pop($s);
    if ($s) $values[] = implode(" ", $s);
}
/*
 * this is needed if last parameter TIMESTAMP does not have ' ' (spaces) into value
 */
if (count($values) + 2 == count($keys)) array_push($values, array_pop($keys));
else                                    $values[ count($values) - 1 ] .= " " . array_pop($keys);
$params = array_combine($keys, $values);
print_r($params);
?>

Do you see a better way to split it maybe using regular expression or a different (elegant?) approach?

Can you change the string youre getting? A better practice would be get the recieved string in some sort of format like JSON or XML which would make it way easier to not get accidental parsing mistakes. Or can you not influence how you recieve the string? — Definitely not Rafal
– Definitely not Rafal, Commented Nov 11, 2020 at 8:23
@DefinitelynotRafal unfortunately I cannot. The string is received from an automation host in VFEI (Virtual Factory Equipment Interface) format (that's ad unchangeable standard). — Stefano Radaelli
– Stefano Radaelli, Commented Nov 11, 2020 at 8:31

mickmackusa · Accepted Answer · 2020-11-11 12:37:02Z

4

The important thing to do in maintaining accuracy is to ensure that "keys" are properly matched.

Key strings will never contain a space or an equals sign. Value strings may contain either. Value strings will run to the end of the string or be followed by a space then the next key (which may not have any spaces or equal signs).

The key string can be "greedily" matched before the occurrence of the first encountered =.

The value string must not be greedily matched. This ensures that the value is not over-extended into the next key-value pair.

The lookahead after the value string ensures that the potential following key is not damaged/consumed.

Pattern Breakdown:

([^=]+)      #capture one ore more non-equals sign (greedily) and store as capture group #1
=            #match but do not capture an equals sign
(.+?)        #capture one or more of any non-newline character (giving back when possible / non-greedy) and store as capture group #2
(?=          #start lookahead
  $          #match the end of the string
  |          #OR operator
   [^ =]+=   #match space, then one or more non-space and non-equals characters, then match equals sign
)            #end lookahead

Code: (Demo)

$msg = "ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999";

preg_match_all('~([^=]+)=(.+?)(?=$| [^ =]+=)~', $msg, $out);
var_export(array_combine($out[1], $out[2]));

Output:

array (
  'ALARM_ID/I4' => '1010001',
  'ALARM_STATE/U4' => 'eventcode',
  'ALARM_TEXT/A' => 'WMR_MAP_EXPORT',
  'LOTS/A[1]' => '[ STEFANO ]',
  'ALARM_STATE/U1' => '1',
  'WAFER/U4' => '1',
  'VI_KLARF_MAP/A' => '/test/klarf.map',
  'KLARF_STEPID/A' => 'StepID',
  'KLARF_DEVICEID/A' => 'DeviceID',
  'KLARF_EQUIPMENTID/A' => 'EquipmentID',
  'KLARF_SETUP_ID/A' => 'SetupID',
  'RULE_ID/U4' => '1234',
  'RULE_FORMULA_EXPRESSION/A' => 'a < b && c > d',
  'RULE_FORMULA_TEXT/A' => '1 < 0 && 2 > 3',
  'RULE_FORMULA_RESULT/A' => 'FAIL',
  'TIMESTAMP/A' => '10-Nov-2020 09:10:11 99999999',
)

edited Nov 11, 2020 at 12:37

answered Nov 11, 2020 at 9:02

mickmackusa♦

49.2k13 gold badges98 silver badges165 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

mickmackusa Over a year ago

Can anyone explain why fourthbird's answer is gathering more votes than my correct, accurate, concise answer? At one point, they both had 1 UV, but for some unknown reason, his/her answer is pulling ahead and biasing researchers away from my answer. If there is something beyond a popularity contest, I'd like to know what is going on.

mickmackusa Over a year ago

I know I am not everyone's cup of tea, but the voting should be on the answer, not the answerer. If the UVs are from writing out the regex breakdown, then I am happy to edit my answer.

mickmackusa Over a year ago

@Stefano I would like to understand the metric by which you found TheFourthBird's answer to be superior. I used regex101 to compare the patterns and these are the results: His first pattern: ([^\s=/]+/[^\s=]+)=(.*?)(?=\h+[^\s=/]+/|$), 42-character pattern, 16 matches, 921 steps ; His second pattern: ([^\W_]+(?:_[^\W_]+)*/[^\s=]*)=(.*?)(?=\h+[^\s=/]+/|$), 54-character pattern, 16 matches, 1007 steps ; My pattern: ([^=]+)=(.+?)(?=$| [^ =]+=) , 27-character pattern , 16 matches, 809 steps So, mine is provably more efficient and more concise.

The fourth bird · Accepted Answer · 2020-11-11 09:22:18Z

3

You could leverage the the presence of a / in all the keys

([^\s=/]+/[^\s=]+)=(.*?)(?=\h+[^\s=/]+/|$)

Explanation

( Capture group 1
- [^\s=/]+ Match 0+ times any char except a whitespace = or /
- /[^\s=]+ Then match / followed by the rest of the key
) Close group 1
= Match literally
(.*?) Capture group 2, match any char except a newline as least as possible
(?=\h+[^\s=/]+/|$) Assert a key like format containing a / (as used in group 1)

See a Regex demo and a Php demo.

Example code

$re = '`([^\s=/]+/[^\s=]+)=(.*?)(?=\h+[^\s=/]+/|$)`';
$str = 'ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999
';

preg_match_all($re, $str, $matches);
$result = array_combine($matches[1], $matches[2]);

print_r($result);

Output

Array
(
    [ALARM_ID/I4] => 1010001
    [ALARM_STATE/U4] => eventcode
    [ALARM_TEXT/A] => WMR_MAP_EXPORT
    [LOTS/A[1]] => [ STEFANO ]
    [ALARM_STATE/U1] => 1
    [WAFER/U4] => 1
    [VI_KLARF_MAP/A] => /test/klarf.map
    [KLARF_STEPID/A] => StepID
    [KLARF_DEVICEID/A] => DeviceID
    [KLARF_EQUIPMENTID/A] => EquipmentID
    [KLARF_SETUP_ID/A] => SetupID
    [RULE_ID/U4] => 1234
    [RULE_FORMULA_EXPRESSION/A] => a < b && c > d
    [RULE_FORMULA_TEXT/A] => 1 < 0 && 2 > 3
    [RULE_FORMULA_RESULT/A] => FAIL
    [TIMESTAMP/A] => 10-Nov-2020 09:10:11 99999999
)

If the keys should all start with word characters separated by an underscore, you can start the pattern using a repeating part [^\W_]+(?:_[^\W_]+)*

It will match word chars except an _, and then repeat matching _ followed by word chars except _ until it reaches a /

([^\W_]+(?:_[^\W_]+)*/[^\s=]*)=(.*?)(?=\h+[^\s=/]+/|$)

Regex demo

edited Nov 11, 2020 at 9:22

answered Nov 11, 2020 at 8:53

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

4 Comments

Jeto Over a year ago

Just out of curiosity: why use \s sometimes and \h some other time? I understand \s includes carriage returns as well (and maybe some vertical whitespaces), but since the original string doesn't appear to contain any, I'm wondering.

mickmackusa Over a year ago

There is no mention of tabs or newlines in the sample input. The \s, the \h, and the m all seem needless to me.

The fourth bird Over a year ago

@Jeto Fair question, I have used \s in the negated character class [^\s=]+ to match any character except a whitespace char for the key as \s can also match a newline which I assume is not desired in the key. I use \h in the assertion to match horizontal whitespace chars to make sure the value is on the same line. I think for this example data you could use both \s or \h either way.

The fourth bird Over a year ago

@mickmackusa the m should not be there, I was from copy pasting from the regex101 generated code. If you only want to match a space instead of \s or \h that is fine. I use it to match a broader range of whitspace chars.

KIKO Software · Accepted Answer · 2020-11-11 11:25:35Z

I managed this code, using basic PHP functions. I think that a regular expression makes the code more difficult to read. Most of the time, even at the expense of having more verbose code, you are better off not using regular expressions. There might also be a performance impact.

$message = "ALARM_ID/I4=1010001 ALARM_STATE/U4=eventcode ALARM_TEXT/A=WMR_MAP_EXPORT LOTS/A[1]=[ STEFANO ] ALARM_STATE/U1=1 WAFER/U4=1 VI_KLARF_MAP/A=/test/klarf.map KLARF_STEPID/A=StepID KLARF_DEVICEID/A=DeviceID KLARF_EQUIPMENTID/A=EquipmentID KLARF_SETUP_ID/A=SetupID RULE_ID/U4=1234 RULE_FORMULA_EXPRESSION/A=a < b && c > d RULE_FORMULA_TEXT/A=1 < 0 && 2 > 3 RULE_FORMULA_RESULT/A=FAIL TIMESTAMP/A=10-Nov-2020 09:10:11 99999999";

foreach (explode(' ', $message) as $word) {
    if (strpos($word, '=')) {
        if (isset($key)) $parameters[$key] = $value; 
        list($key, $value) = explode('=', $word);
    }
    else $value .= " $word";
}    
$parameters[$key] = $value;     

echo '<pre>';
print_r($parameters);
echo '</pre>';

I chose to split on the spaces, then I look for the = characters to find the words with the keys in them.

There are, of course, other ways of doing the same, but all will involve a bit of extra work because of the strange format of the message.

This routine currently does not tolerate errors in the message string, but it can easily be expanded to tolerate various types of input errors.

Collectives™ on Stack Overflow

php: better way to split string into associative array

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related