0

Currently I have a string that I want to parse and pick up certain values.

The current regex findall pattern that I have is:

re.findall(r'(?P<key>\w+)\s+(?P<value>\w+)')

With this regex findall pattern I can pick up the key and values of the following:

--key1=value1 --key2=value2

But if the value is a string with spaces, it doesn't pick it up. Examples that doesn't work:

--key1=this is value 1 --key2=value2
--key1=only kvp
--key1=this/doesnt/work/

How can I adjust the regex pattern to pick up the string after the = sign?

6
  • 1
    what output you want? Commented Aug 14, 2022 at 20:03
  • 2
    Could You provide a minimal working example for what You have tried? re.findall(r'(?P<key>\w+)\s+(?P<key>\w+)') doesn't work for at least two reasons (one missing argument, redefinition of capture group key). Commented Aug 14, 2022 at 20:09
  • can you edit your code with the output you want for your sample input? Commented Aug 14, 2022 at 20:11
  • Are those command line arguments? They usually escape whitespaces or quote the string. Commented Aug 14, 2022 at 20:26
  • The output I am looking for is a list of the found key, value pairs. So using the example: ['key1', 'value1'], ['key2', 'value2'] Commented Aug 14, 2022 at 20:45

1 Answer 1

1

I started by changing your regex to --(?P<key>\w+)=(?P<value>\w+). This way, it uses "=" instead of a whitespace as a separator between key and value. It also requires "--" to precede the key, which seems to be a rule in your data.

Now let tackle the main problem which is to capture as a value everything after the "=" sign unless it is the next key.

This can be done in three steps:

  1. Change regex for the value from \w+ to .+. You want to capture all characters so you cannot limit yourself to just \w. . will capture everything. Of course this change caused a new problem: the value will now contain everything that follows the key, even if it is "value1 --key2=value2". This will be fixed in the remaining two steps.

  2. The next step is to make the regex non-greedy. Change the regex for value from .+ to .+? and it will capture the least characters it can instead of the most. This still doesn't solve the problem because the regex will capture only one character of the value. We are a step closer, though.

  3. The last step is to prevent the regex from stopping capturing the value until it encounter the next key or the end of the string. Add (?=$|\s--) at the end. (?=) is a positive lookahead. It means that the next part must follow the current position but it is not part of the match itself. $|\s-- is an alternation of either end of the string or a whitespace and two dashes.

The finished regex is:

re.findall(r'--(?P<key>\w+)=(?P<value>.+?)(?=$|\s--)', string)

It should handle everything other than a value that contains --. For example:

import re
string = "--key1=value 1 has--really .:weird:. characters --key2=value2"
result = re.findall(r'--(?P<key>\w+)=(?P<value>.+?)(?=$|\s--)', string)
print(result)

gives:

[('key1', 'value 1 has--really .:weird:. characters'), ('key2', 'value2')]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for breaking it down like this, made me understand this a lot faster! This solves it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.