1

Below is my test string:

Object: TLE-234DSDSDS324-234SDF324ER
  Page location: SDEWRSD3242SD-234/324/234 (1)
    org-chart           Lorem ipsum dolor    consectetur adipiscing          # Colorado
    234DSDSDS324-32-4/2/7-page2 (2) loc log  Apr 18 21:42:49 2017           1
      Page information: 3.32.232.212.23, Error: fatal, Technique: color
        Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
      Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
       Positive control-export: Validated
  Page location: SDEWRSD3242SD-SDF/234/324 (5)
    org-chart           Lorem ipsum dolor    consectetur adipiscin          # Arizona
    234DSDSDS324-23-11/1/0-page1 (1) loc log Apr 18 21:42:49 2017           1
      Page information: 3.32.232.212.23, Error: log, Technique: color
        Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
      Validation status: Lorem ipsums dolors sits amets, consectetur adipiscing elit
       Positive control-export: Validated

I need to capture strings after the "Page location: ", "Object: " and "Comments: "

For example:

Object: TLE-234DSDSDS324-234SDF324ER - Group 1

Page location: SDEWRSD3242SD-234/324/234 (1) - Group 2

Page location: SDEWRSD3242SD-SDF/234/324 (5) - Group 3

Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 4

Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. - Group 5

Here is my regex URL.

I am able to capture the strings but the regex won't capture if any one of the string is repeated.

11
  • You're having problems if ie Page location occurs multiple times, is this right? Commented May 9, 2017 at 21:59
  • Yes. @Jan.. Page location and Comments Commented May 9, 2017 at 22:06
  • Something like this: regex101.com/r/t15dD8/8 ? Commented May 9, 2017 at 22:06
  • Exactly, but it is not matching if i add one more Page location and Comments in the test string? Commented May 9, 2017 at 22:10
  • 1
    Is all this in one string (or is it in separate lines) -- or, how do you get this data into the program? Is "Page location:" unique, so that you always need what follows it? How far after "Page location" do you need to capture -- to the frst newline? This is all shown "inside" of one "Object" -- are there multiple such sections in your string/file? Commented May 9, 2017 at 22:36

1 Answer 1

1

(See comments below the question for the problem description.)

The data is in a multi-line string, with multiple sections starting with Object:. Within each there are multiple lines starting with phrases Page location: and Comments:. The rest of the line for all these need be captured, and all organized by Objects.

Instead of attempting a tortured multi-line "single" regex, break the string into lines and process section by section. This way the problem becomes a very simple one.

The results are stored in an array of hashrefs; each has for keys the shown phrases. Since they can appear more than once per section their values are arrayrefs (with what follows them on the line).

use warnings;
use strict;
use feature 'say';

my $input_string = '...'; 
my @lines = split /\n/, $input_string;

my $patt = qr/Object|Page location|Comments/;

my @sections;
for (@lines) 
{
    next if not /^\s*($patt):\s*(.*)/;

    push @sections, {}  if $1 eq 'Object';

    push @{ $sections[-1]->{$1} }, $2;
}

foreach my $sec (@sections) {
    foreach my $key (sort keys %$sec) {
        say "$key:";
        say "\t$_" for @{$sec->{$key}};
    }   
}

With the input string copied (suppressed above for brevity), the output is

Comments:
        Lorem ipsum dolor sit amet,  [...] 
        Lorem ipsum dolor sit amet,  [...]
Page location:
        SDEWRSD3242SD-234/324/234 (1)
        SDEWRSD3242SD-SDF/234/324 (5)
Object:
        TLE-234DSDSDS324-234SDF324ER

A few comments.

Once the Object line is found we add a new hashref to @sections. Then the match for a pattern is set as a key and the rest of its line added to its arrayref value. This is done for the current (so last) element of @sections.

This adds an empty string if a pattern had nothing following. To disallow add next if not $2;

Note. An easy and common way to print complex data structures is via the core module Data::Dumper. But also see Data::Dump for a much more compact printout.

Sign up to request clarification or add additional context in comments.

2 Comments

I'm guessing here, but does OP need to restart his collection of data for each new object? In that case I'd use an array of hashes; start with an empty hash, capture into it until a new 'Object:' is seen, at which point a new empty anonymous hash is pushed onto the array and data is captured into that. Totally agree that a loop is a far better solution here than trying to compress the logic into a single regex!
@JoeMcMahon Yes, this needs to be adjusted if multiple Object sections exist and need be distinguished. I asked a question, in comments and in the answer itself, and am waiting :). The approach with one regex assumes that all data is available in a string, which isn't clear. My question was answered that it's "separate lines" and then reading a file is common ... I am waiting on that clarification as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.