How to transform structured textfiles into PHP multidimensional array

Question

I have 100 files, each containing an x amount of news articles. The articles are structured via sections with the following abbreviations:

HD BY WC PD SN SC PG LA CY LP TD CO IN NS RE IPC PUB AN

where [LP] and [TD] can contain any number of paragraphs.

A typical messages looks like:

HD Corporate News: Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
WC 421 words
PD 12 July 2011
SN The Wall Street Journal
SC J
PG B7
LA English
CY (Copyright (c) 2011, Dow Jones & Company, Inc.) 

LP 

Alcoa Inc.'s profit more than doubled in the second quarter, but the giant 
aluminum producer managed only to meet analysts' recently lowered forecasts.

Alcoa serves as a bellwether for U.S. corporate earnings because it is the 
first major company to report and draws demand from a wide range of 
industries.

TD 

The results marked an early test of how corporate optimism is holding up 
in the face of bleak economic news.

License this article from Dow Jones Reprint 
Service[http://www.djreprints.com/link/link.html?FACTIVA=wjco20110712000115]

CO 
almam : ALCOA Inc

IN 
i2245 : Aluminum | i22 : Primary Metals | i224 : Non-ferrous Metals | imet 
  : Metals/Mining

NS 
c15 : Performance | c151 : Earnings | c1521 : Analyst 
Comment/Recommendation | ccat : Corporate/Industrial News | c152 : 
Earnings Projections | ncat : Content Types | nfact : Factiva Filters | 
nfce : FC&E Exclusion Filter | nfcpin : FC&E Industry News Filter

RE 
usa : United States | use : Northeast U.S. | uspa : Pennsylvania | namz : 
North America

IPC 
DJCS | EWR | BSC | NND | CNS | LMJ | TPT

PUB 
Dow Jones & Company, Inc.

AN 
Document J000000020110712e77c00035

After each article, there are 4 newlines before a new article starts. I need to put these articles into an array, as follows:

$articles = array(
  [0] = array (
    [HD] => Corporate News: Alcoa earnings Soar; Outlook...
    [BY] => By James R. Hagerty...
    ...
    [AN] => Document J000000020110712e77c00035
  )
)

What you have tried?

Jason McCreary
– Jason McCreary

2013-08-19 16:23:59 +00:00
Commented Aug 19, 2013 at 16:23 — Jason McCreary
– Jason McCreary, Commented Aug 19, 2013 at 16:23

14 revs · Accepted Answer · 2021-11-13 19:26:09Z

3

A way that uses explode to separate each block and a regex to extract the fields:

$pattern = <<<'LOD'
~
# definition
(?<fieldname> (?:HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)$ ){0}

# pattern
\G(?<key>\g<fieldname>) \s+
(?<value>
    .+ 
    (?: \R{1,2} (?!\g<fieldname>) .+ )*+
)
(?:\R{1,3}|\z)
~xm
LOD;
$subjects = explode("\r\n\r\n\r\n\r\n", $text);
$result = array();

foreach($subjects as $i => $subject) {
    if (preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER)) {
        foreach ($matches as $match) {
            $result[$i][$match['key']]=$match['value'];
        }
    }
}
echo '<pre>', print_r($result, true);

Pattern details:

The pattern is divided into two parts:

In the definition part I wrote a subpattern named fieldname to use it later in the main pattern. This pattern also checks each fieldname ends a line with the $ anchor.

The main pattern:

\G                        # this forces the match to be contiguous to the
                          # precedent match or the start of the string (no gap)
(?<key> \g<fieldname> )   # a capturing group named "key" for the fieldname
\s+                       # one or more white characters
(?<value>                 # open a capturing group named "value" for the
                          # field content
    .+                    # all characters except newlines 1 or more times
    (?:                   # open an atomic group
        \R\R?+            # one or two newlines to allow paragraphs (LP & TD) 
        (?!\g<fieldname>) # but not followed by a fieldname (only a check)
        .+                #
    )*+                   # close the atomic group and repeat 0 or more times
)                         # close the capture group "value"
(?:\R{1,3}|\z)            # between 1 or 3 newlines max. or the end of the
                          # string (necessary if i want contigous matches)

global modifiers:

x (Extended mode): whitespaces and inline comments starting with # are ignored in the pattern.
m (Multiline mode): ^ matches the start of the lines and $ the end of the lines.

edited Nov 13, 2021 at 19:26

community wiki

14 revs
Casimir et Hippolyte

Sign up to request clarification or add additional context in comments.

7 Comments

AbsoluteƵERØ Over a year ago

You might want to link to the Heredoc string quoting. If someone pastes this into something like Dreamweaver it will have all sorts of errors. php.net/manual/en/…

Pr0no Over a year ago

Thanks! But running this for the example in the TS returns zero matches. $subjects contains one document (as given in the $text from the TS) but the pattern is not matching anything?

Casimir et Hippolyte Over a year ago

@Pr0no: I have tested with the Text Sample and it works well. I will post the data sample.

Pr0no Over a year ago

I have updated the TS to reflect your answer. I can't get it to work. Where am i going wrong?

Casimir et Hippolyte Over a year ago

@Pr0no: the first delimiter ~ must be just after 'LOD' on the next line (no spaces and no tabs before). Since your text file use \r\n for newlines you must replace \n by \r\n in the pattern, see the edit.

|

Collectives™ on Stack Overflow

How to transform structured textfiles into PHP multidimensional array

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related