2

I have a question related to using regex to pull out data from a text file. I have a text file in the following format:

REPORTING-OWNER:    

    OWNER DATA: 
        COMPANY CONFORMED NAME:         DOE JOHN
        CENTRAL INDEX KEY:          99999999999

    FILING VALUES:
        FORM TYPE:      4
        SEC ACT:        1934 Act
        SEC FILE NUMBER:    811-00248
        FILM NUMBER:        11530052

    MAIL ADDRESS:   
        STREET 1:       7 ST PAUL STREET
        STREET 2:       STE 1140
        CITY:           BALTIMORE
        STATE:          MD
        ZIP:            21202

ISSUER:     

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         ACME INC
        CENTRAL INDEX KEY:          0000002230
        IRS NUMBER:             134912740
        STATE OF INCORPORATION:         MD
        FISCAL YEAR END:            1231

    BUSINESS ADDRESS:   
        STREET 1:       SEVEN ST PAUL ST STE 1140
        CITY:           BALTIMORE
        STATE:          MD
        ZIP:            21202
        BUSINESS PHONE:     4107525900

    MAIL ADDRESS:   
        STREET 1:       7 ST PAUL STREET SUITE 1140
        CITY:           BALTIMORE
        STATE:          MD
        ZIP:            21202

I want to save the owner's name (John Doe) and identifier (99999999999) and the company's name (ACME Inc) and identfier (0000002230) as separate variables. However, as you can see, the variable names (CENTRAL INDEX KEY and COMPANY CONFORMED NAME) are exactly the same for both pieces of information.

I've used the following code to extract the owner's information, but I can't figure out how to extract the data for the company. (Note: I read the entire text file into $data).

if($data=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;}
if($data=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$1;}

Any idea as to how I can extract the information for both the owner and the company?

Thanks!

5 Answers 5

3

There is a big difference between doing it quick and dirty with regexes (maintenance nightmare), or doing it right.

As it happens, the file you gave looks very much like YAML.

use YAML;
my $data = Load(...);
say $data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"};
say $data->{"ISSUER"}->{"COMPANY DATA"}->{"COMPANY CONFORMED NAME"};

Prints:

DOE JOHN
ACME INC

Isn't that cool? All in a few lines of safe and maintainable code ☺

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks to everyone for the comments. I tried running the YAML code and it told me that I have inconsistent indentation...so apparently I need to go back and check and make sure that my files are formatted correctly.
0
my ($ownname, $ownkey, $comname, $comkey) = $data =~ /\bOWNER DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+).*\bCOMPANY DATA:\s+COMPANY CONFORMED NAME:\s+([^\n]+)\s*CENTRAL INDEX KEY:\s+(\d+)/ms

If you're reading this file on a UNIX operating system but it was generated on Windows, then line endings will be indicated by the character pair \r\n instead of just \n, and in this case you should do

$data =~ tr/\r//d;

first to get rid of these \r characters and prevent them from finding their way into $ownname and $comname.

Comments

0

Select both bits of information at the same time so that you know that you're getting the CENTRAL INDEX KEY associated with either the owner or the company.

($name, $cik) = $data =~ /COMPANY\s+CONFORMED\s+NAME:\s+(.+)$\s+CENTRAL\s+INDEX\s+KEY:\s+(.*)$/m;

Comments

0

Instead of trying to match elements in the string, split it into lines, and parse properly into data structure that will let such searches be made easily, like:

$data->{"REPORTING-OWNER"}->{"OWNER DATA"}->{"COMPANY CONFORMED NAME"}

That should be relatively easy to do.

4 Comments

but completely unnecessary.
Regexp can do it. Sure. But that doesn't mean it's good idea.
@depesz I'm with you on this one. Using regexes here is idiotic. YAML comes to the rescue, and creates exactly the data structure you described!
Oops. Added my code, and later on realized that this is exactly the same thing @amon did in his answer :)
-1

Search for OWNER DATA: read one more line, split on : and take the last field. Same for COMPANY DATA: header (sortof), on so on

1 Comment

why not just extract everything from $data for either the owner or the company in one regular expression?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.