18

Consider the following strings:

1) Scheme ID: abc-456-hu5t10 (High priority) *****

2) Scheme ID: frt-78f-hj542w (Balanced)

3) Scheme ID: 23f-f974-nm54w (super formula run) *****

and so on in the above format - the parts in bold are changes across the strings.

==> Imagine I've many strings of format Shown above. I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.

  • 1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
  • 2nd substring containing the word (in eg above it's "High priority")
  • 3rd substring containing * (IF * is present at the end of the string ELSE leave it )

How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?

1
  • Can the string in parentheses itself contain nested parentheses? Commented Sep 18, 2009 at 12:02

7 Answers 7

34

You could do something like this:

my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END

foreach (split(/\n/,$data)) {
  $_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
  my ($id,$word,$star) = ($1,$2,$3);
  print "$id $word $star\n";
}

The key thing is the Regular expression:

Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?

Which breaks up as follows.

The fixed String "Scheme ID: ":

Scheme ID: 

Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:

([a-z0-9-]+)

Followed by one or more whitespace characters:

\s+

Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:

\(([^)]+)\)

Followed by some spaces any maybe a *, captured as $3:

\s*(\*)?
Sign up to request clarification or add additional context in comments.

Comments

5

You could use a regular expression such as the following:

/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/

So for example:

$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";

prints

abc-456-hu5t10
High priority
*

Comments

3
(\S*)\s*\((.*?)\)\s*(\*?)


(\S*)    picks up anything which is NOT whitespace
\s*      0 or more whitespace characters
\(       a literal open parenthesis
(.*?)    anything, non-greedy so stops on first occurrence of...
\)       a literal close parenthesis
\s*      0 or more whitespace characters
(\*?)    0 or 1 occurances of literal *

1 Comment

(([^)])) would be better than ((.*?)), as it's guaranteed to stop at the first ). Non-greedy quantifiers can cause heavy backtracking, which kills performance. (Unlikely in this case, admittedly, but avoiding them when they're not needed is still a good habit to cultivate.) The negated character class is also a clearer statement of your intent - you're looking for "any number of non-) characters", not "the smallest number of any character at all, followed by a ), which makes the expression as a whole match".
2

Well, a one liner here:

perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt

Expanded to a simple script to explain things a bit better:

#!/usr/bin/perl -ln              

#-w : warnings                   
#-l : print newline after every print                               
#-n : apply script body to stdin or files listed at commandline, dont print $_           

use strict; #always do this.     

my $regex = qr{  # precompile regex                                 
  Scheme\ ID:      # to match beginning of line.                      
  \s+              # 1 or more whitespace                             
  (.*?)            # Non greedy match of all characters up to         
  \s+              # 1 or more whitespace                             
  \(               # parenthesis literal                              
    (.*?)            # non-greedy match to the next                     
  \)               # closing literal parenthesis                      
  \s*              # 0 or more whitespace (trailing * is optional)    
  (\*)?            # 0 or 1 literal *s                                
}x;  #x switch allows whitespace in regex to allow documentation.   

#values trapped in $1 $2 $3, so do whatever you need to:            
#Perl lets you use any characters as delimiters, i like pipes because                    
#they reduce the amount of escaping when using file paths           
m|$regex| && print "$1 : $2 : $3";

#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }     

Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.

Comments

2

Long time no Perl

while(<STDIN>) {
    next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
    print "|$1|$2|$3|\n";
}

Comments

2

This just requires a small change to my last answer:

my ($guid, $scheme, $star) = $line =~ m{
    The [ ] Scheme [ ] GUID: [ ]
    ([a-zA-Z0-9-]+)          #capture the guid
    [ ]
    \(  (.+)  \)             #capture the scheme 
    (?:
        [ ]
        ([*])                #capture the star 
    )?                       #if it exists
}x;

Comments

0

String 1:

$input =~ /'^\S+'/;
$s1 = $&;

String 2:

$input =~ /\(.*\)/;
$s2 = $&;

String 3:

$input =~ /\*?$/;
$s3 = $&;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.