2

I am trying to use a regex to parse out the key-value pairs of command-line switches. Here's what I've got so far:

(?<=(^-{1,2}| -{1,2}|^/| /))(?<name>[\w]+)[ :"]*(?<value>[\w.?=&+ :/|\\]*)(?=[ "]|$)

It seems to parse everything properly... almost. If there are hyphens in the value, it craps out on the match. How do I tweak this to work on all the test examples below?

test examples (all valid):

-s  -i:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB
-o:"C:\temp\db\" -s -r -host:localhost --d theDB
-s  -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB -Scripts\" -h:localhost -d:theDB
-s  -d http://www.theproject.com -h:localhost -d:theDB
-i:"C:\Users\Fozzie\Workspace\TheProject\TheProject_Stack_1_5\db\DB Scripts\" --h:localhost -d:theDB
-h:localhost -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack_1_5\db\DB Scripts\"  -d:theDB
--d theDB   -o:"C:\temp\db\" -host=local-host     -r

The regex fails when the value part is something like

"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\"

or "local-host" due to the hyphens therein. It is seen as the start of a new switch.

PS: I don't want to use a canned options library like getops. I'm interested in getting the regex right.

Thanks.

UPDATE: Sorry for the missing detail: this is a .NET regex.

5
  • 2
    Would you like to add which programming language you use and what the regex means. esp what are the ^/|/ and what does the ? in (?[\w]+) means Commented Oct 30, 2012 at 12:06
  • It appears that you are dealing with a string containing the space-separated concatenation of separate command-line arguments, rather than already split command line arguments, which is the more usual problem. In the first line of the sample data, where exactly does the argument associated with the -i: end, and how do you know that? It looks to me like Scripts\ is a loose word on its own; is that correct? Or does the argument continue until the next blank followed by dash (or end of string)? Commented Oct 30, 2012 at 12:59
  • This isn't a Perl regex, or at least the current perlre doesn't recognize variable width look-behind (the material in the first set of balanced parentheses — (?<=(^-{1,2}| -{1,2}|^/| /))). So, which sub-species of RE are you targetting? Which language supports the variable-width look-behind? (Also, none of your sample data uses the /-as-option-introducer notation that you seem to be supporting.) Commented Oct 30, 2012 at 13:21
  • @jonathan: My apologies on that necessary detail. I added it above. Commented Oct 30, 2012 at 22:15
  • Changed the regex, now accepts filepaths, see it now Commented Oct 30, 2012 at 23:10

2 Answers 2

3

.NET solution — speculative

This suggested .NET solution is just 'suggested'; I don't do .NET and have no way of testing it on any of my machines (is there a regex-test web site for .NET?). I've taken the working Perl solution, removed the <mark> and <pad> parts that you're not worried about, and the comments, and flattened it all onto one line on the assumption that .NET doesn't have an option for legibility analogous to Perl's x option. You can still find 5 sets of parentheses corresponding to the 5 parts of the Perl regex. I'm assuming that (?:...) is a non-capturing group.

(?:-{1,2}|/)(?<name>\w+)(?:[=:]?|\s+)(?<value>[^-\s"][^"]*?|"[^"]*")?(?=\s+[-/]|$)

I also assume that .NET provides some mechanism analogous to Perl's g modifier that allows you to scan the string on a second (or subsequent) pass where it left off on the previous pass. Or that you can somehow determine where the end of the match was and resume the scan from there.

Perl solution — validated

This is as good as I've managed to come up with using Perl regexes (tested with Perl 5.16.0 on Mac OS X 10.7.5).

#!/usr/bin/env perl
use strict;
use warnings;

# Original regex split into 5 sections:
# C1          (?<=(^-{1,2}|\ -{1,2}|^/|\ /))
# C2          (?<name>[\w]+)
# C3          [ :"]*
# C4          (?<value>[\w.?=&+ :/|\\]*)
# C5          (?=[ "]|$)

my $rx = qr%(?<mark>  -{1,2}|/ )                        (?# Was C1)
            (?<name>  \w+ )                             (?# Was C2)
            (?<pad>   (?: [=:]?|\s+ ))                  (?# Was C3)
            (?<value> (?: [^-\s"][^"]*? | "[^"]*" ))?   (?# Was C4)
            (?=\s+[-/]|$)                               (?# Was C5)
           %x;

while (my $line = <DATA>)
{
    chomp $line;
    print "\nLine: $line\n";
    while ($line =~ m/$rx/g)
    {
        my($mark, $name, $pad, $value) = ($1, $2, $3, $4 // "");
        print "Found: mark $mark name <<$name>> pad <<$pad>> value <<$value>>\n";
    }
}

__DATA__
-s  -i:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB
-o:"C:\temp\db\" -s -r -host:localhost --d theDB
-s  -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB -Scripts\" -h:localhost -d:theDB
-s  -d http://www.theproject.com -h:localhost -d:theDB
-i:"C:\Users\Fozzie\Workspace\TheProject\TheProject_Stack_1_5\db\DB Scripts\" --h:localhost -d:theDB
-h:localhost -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack_1_5\db\DB Scripts\"  -d:theDB
--d theDB   -o:"C:\temp\db\" -host=local-host     -r
-s -i:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB
/d theDB   /o:"C:\temp\db\" /host=local-host     /r
/d theDB /o:"C:\temp\db\" /host=local-host /r /t
-s:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB

The bulk of the script is not very interesting. The outer while loop reads the data section of the file (which is the material after the marker __DATA__) one line at a time, prints it for validation, then repeatedly runs the regex on the line to find the components (the marker, the name, the padding, and the value), printing those out. The bulk of the data is what was provided in the question (thank you!). The last three lines of the data are extra compared to what was originally provided.

All the excitement is in the regex. I've used Perl's /x modifier to allow white space in the regex for readability. This means that white space is not significant unless preceded by a backslash or enclosed in square brackets (and there is no significant white space in this specimen). I've used the (?<name> ...) notation to identify the pieces as in the original, though the names could be omitted since they aren't used. The (?# Was Cn) parts are pure comment.

  1. The mark is either one or two dashes or a slash; --? would be another, shorter way to write it.
  2. The name is a string of alphanumerics; this does not attempt to enforce 'first character may not be a digit'.
  3. The pad separates the name from the value. It can be a single equals or colon, or a string of white space. The inner (?: ...) is a non-capturing grouping operator.
  4. The value is optional (the -s option in the first position of the first line of the sample data doesn't have a value). It consists of: either a string starting with something other than a dash, double quote or white space, followed by a non-greedy string of non-quotes; or a double quote, a string of non-quotes, and another double quote.
  5. The trailing zero-width context (C5) is either one or more white space characters followed by a dash or slash, or EOS. Because the value pattern is non-greedy, the greedy trailing context gobbles the trailing white space after an option value.

The output is:

Line: -s  -i:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB
Found: mark - name <<s>> pad << >> value <<>>
Found: mark - name <<i>> pad <<:>> value <<C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\>>
Found: mark - name <<h>> pad <<:>> value <<local:host>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>

Line: -o:"C:\temp\db\" -s -r -host:localhost --d theDB
Found: mark - name <<o>> pad <<:>> value <<"C:\temp\db\">>
Found: mark - name <<s>> pad <<>> value <<>>
Found: mark - name <<r>> pad <<>> value <<>>
Found: mark - name <<host>> pad <<:>> value <<localhost>>
Found: mark -- name <<d>> pad << >> value <<theDB>>

Line: -s  -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB -Scripts\" -h:localhost -d:theDB
Found: mark - name <<s>> pad << >> value <<>>
Found: mark - name <<i>> pad <<:>> value <<"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB -Scripts\">>
Found: mark - name <<h>> pad <<:>> value <<localhost>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>

Line: -s  -d http://www.theproject.com -h:localhost -d:theDB
Found: mark - name <<s>> pad << >> value <<>>
Found: mark - name <<d>> pad << >> value <<http://www.theproject.com>>
Found: mark - name <<h>> pad <<:>> value <<localhost>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>

Line: -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject_Stack_1_5\db\DB Scripts\" --h:localhost -d:theDB
Found: mark - name <<i>> pad <<:>> value <<"C:\Users\Fozzie\Workspace\TheProject\TheProject_Stack_1_5\db\DB Scripts\">>
Found: mark -- name <<h>> pad <<:>> value <<localhost>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>

Line: -h:localhost -i:"C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack_1_5\db\DB Scripts\"  -d:theDB
Found: mark - name <<h>> pad <<:>> value <<localhost>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>

Line: --d theDB   -o:"C:\temp\db\" -host=local-host     -r
Found: mark -- name <<d>> pad << >> value <<theDB>>
Found: mark - name <<o>> pad <<:>> value <<"C:\temp\db\">>
Found: mark - name <<host>> pad <<=>> value <<local-host>>
Found: mark - name <<r>> pad <<>> value <<>>

Line: -s -i:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB
Found: mark - name <<s>> pad <<>> value <<>>
Found: mark - name <<i>> pad <<:>> value <<C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\>>
Found: mark - name <<h>> pad <<:>> value <<local:host>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>

Line: /d theDB   /o:"C:\temp\db\" /host=local-host     /r
Found: mark / name <<d>> pad << >> value <<theDB  >>
Found: mark / name <<o>> pad <<:>> value <<"C:\temp\db\">>
Found: mark / name <<host>> pad <<=>> value <<local-host    >>
Found: mark / name <<r>> pad <<>> value <<>>

Line: /d theDB /o:"C:\temp\db\" /host=local-host /r /t
Found: mark / name <<d>> pad << >> value <<theDB>>
Found: mark / name <<o>> pad <<:>> value <<"C:\temp\db\">>
Found: mark / name <<host>> pad <<=>> value <<local-host>>
Found: mark / name <<r>> pad <<>> value <<>>
Found: mark / name <<t>> pad <<>> value <<>>

Line: -s:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\ -h:local:host -d:theDB
Found: mark - name <<s>> pad <<:>> value <<C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts\>>
Found: mark - name <<h>> pad <<:>> value <<local:host>>
Found: mark - name <<d>> pad <<:>> value <<theDB>>
Sign up to request clarification or add additional context in comments.

4 Comments

I converted it to .NET form like you did above. This does seem to catch all of the above sample cases. Thank you!
I am vaguely familiar with Perl's /g but in what way would it be necessary? The .NET regex as you have above seems to match all my sample cases.
If the regex does what you need, then you're good to go. If the /g was omitted in the Perl code (where the regex is applied), then it would find the first option each time it was applied to the string. The /g ensures that it steps through the string — that's all. If the way you write the code in .NET to use the regex already handles that for you, or is just 'standard technique' then that's great. It's always a problem where there's a lack of knowledge about the details of the other person's area of expertise, yet you're trying to help each other understand enough. I was covering my bases!
I'm sorry, I was confusing /g with \G. .NET can do /g (global) matching. Thanks again.
0

Edit: changed the solution so now it can match FILEPATHS like D:\huheue\hello\my name.pdf

This is the regular expression I've got and it works pretty well.

(--?[a-zA-Z]+)[:\s=]?([A-Z]:(?:\\[\w\s-]+)+\\?(?=\s-)|\"[^\"]*\"|[^-][^\s]*)?

Demo

I hope It's what you needed and you will have an output like this:

MATCH 1
1.  [0-2]   `-s`
MATCH 2
1.  [3-5]   `-i`
2.  [6-70]  `C:\Users\Fozzie\Workspace\TheProject\TheProaject-Stack-1_5\db\DB Scripts`
MATCH 3
1.  [80-82] `-h`
2.  [83-93] `local:host`
ETC

2 Comments

It doesn't seem to catch the -d http://www.theproject.com or paths with spaces like `-i:C:\Users\Fozzie\Workspace\TheProject\TheProject-Stack-1_5\db\DB Scripts`. But, it closer than mine! :)
Because you shouldn't use it like that dude, you should quote it properly and it catchs theproject.com but no spaces.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.