0

I have strings of data: number, space(s), then a word that can contain letters, numbers and special characters as well as spaces. I need to isolate the first number only, and then also the words only so I can re-render the data into a table.

1 foo
2   ba_r
3  foo bar
4   fo-o

EDIT: I was attempting this with "^[0-9]+[" "]" however that doesn't work.

1
  • can you show us the regex that you are using so far? StackOverflow is not a community that servers you finished code, but a community that helps you debug and improve your own.. Commented Jun 12, 2013 at 15:00

2 Answers 2

3

You can use this regex to capture each line:

/^(\d+)\s+(.*)$/m

This regex starts on each line, captures one or more digits, then matches one or more space characters, then captures anything until the end of line.

Then, with preg_match_all(), you can get the data you want:

preg_match_all( '/^(\d+)\s+(.*)$/m', $input, $matches, PREG_SET_ORDER);

Then, you can just parse out the data from the $matches array, like this:

$data = array();
foreach( $matches as $match) {
    list( , $num, $word) = $match;
    $data[] = array( $num, $word);
    // Or: $data[$num] = $word;
}

A print_r( $data); will print:

Array
(
    [0] => Array
        (
            [0] => 1
            [1] => foo
        )

    [1] => Array
        (
            [0] => 2
            [1] => ba_r
        )

    [2] => Array
        (
            [0] => 3
            [1] => foo bar
        )

    [3] => Array
        (
            [0] => 4
            [1] => fo-o
        )

)
Sign up to request clarification or add additional context in comments.

5 Comments

@Downvoter - Any comment? I'd like to improve my answer if possible.
i did not downvote, however i may have suggestions. i do not see the point in ^, $ and the m modifier. the m modifier here is only necessary to have matches with ^ and $. however since .* does not match newlines without the s, and the pattern must therefore be matched within a single line anyway, this is not really necessary. the only thing it does is not having a mathc in lines that have non-digit caracters before the leading number. and i dont know why one would want that. a simpler solution for the same thing would be .* in the beginning. also the loop appears unnecessary.
max characters of the response went out... so in short: you have code in there that appears not necessary and unnecessarily complicated (i bet a lot of programmers do not even know by heart what m does. hwoever a simple .* in the beginning is clear)
@TheSurrican While you raise some interesting points, I would have to disagree. The PCRE regex modifiers are quite ubiquitous (IMO), and for your explanation, you needed to clarify that .* does not match newlines, which is something somebody can easily forget. But, anchoring the regex at the start/end of line not only distinctly and clearly defines that the match we are looking for spans one complete line, it also prevents errors where another regex could match within a line, which would be incorrect. For example: foo 1 bar 2 baz 3. Clearly this is erroneous input, and should be ignored.
i think that depends on the scenario where the regex is employed. in the context of this question i understand that the text syntax can be relied upon and the greedyness of the asterisk modifier takes care that the whole line is matched. probably, in the end, its a question of style...
2
$str = <<<body
1 foo
2   ba_r
3  foo bar
4   fo-o
body;

preg_match_all('/(?P<numbers>\d+) +(?P<words>.+)/', $str, $matches);
print_r(array_combine($matches['numbers'],$matches['words']));

outputs

Array
(
    [1] => foo
    [2] => ba_r
    [3] => foo bar
    [4] => fo-o
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.