0

So I have text from which I'd like to extract some key-value pairs. I want to do this in the most efficient way possible, so I was thinking of regex. But I don't understand how to say "if this key exists - take its value, and if not, continue taking other existing key-pair values".

So let's say I have this text, and I want to extract only Value3 and Value4:

Placeholder1
String: Key1=Value1, Key2=Value2, Key3=Value3, Key4=Value4
Placeholder2
String: Key1=Value1, Key2=Value2, Key3=Value3, Key4=Value4

For this run I just want the first appearance, i.e. right after Placeholder1. So I have something like this:

Placeholder1\s*.*Key3=([a-zA-Z0-9 -]*).*Key4=([a-zA-Z0-9 -]*)

Which works and gets me Group 1 = Value3, Group 2 = Value4. Excellent. However, if I have the following string without Key3=Value3:

Placeholder1
String: Key1=Value1, Key2=Value2, Key5=Value5, Key4=Value4

My regex of course doesn't work, even though I want it to get me Key4. So I thought that putting the groups with a ()? would work, so that if it exists it will take it, and if not - move on:

Placeholder1\s*.*(Key3=([a-zA-Z0-9 -]*))?.*(Key4=([a-zA-Z0-9 -]*))?

However adding the ? returns me nothing from the original text where both key-value pairs exist. When I remove the ? it will work again, but not when Key3 is missing.

So how do I build a regex that will take the maximum number of key-value pairs that exist in the text?

PS - The key-value pairs can appear with/without other key-value pairs between them.

2 Answers 2

2

Your regex is almost fine except the greedy expression .* just before Key3 in your regex, as that is causing exhaustive match due to which .* eats all it can and doesn't leave Key3 value for getting captured. Just add ? after .* to make it non-greedy and it will work exactly the way you expected.

Your regex : Placeholder1\s*.*Key3=([a-zA-Z0-9 -]*).*Key4=([a-zA-Z0-9 -]*)

Modified regex: Placeholder1\s*.*?Key3=([a-zA-Z0-9 -]*).*Key4=([a-zA-Z0-9 -]*)

See this demo,

Demo

Edit: If both Key3 and Key4 can be optionally present

Then you can use this regex,

Placeholder1\s*(?:(?!(?:Key[34])).)*(?:Key3=([a-zA-Z0-9 -]*))?(?:(?!(?:Key[34])).)*(?:Key4=([a-zA-Z0-9 -]*))?

Here is the explanation:

Although above regex might appear a little complex but explanation to it is indeed quite simple. If you notice, I have just replaced . from your original regex with (?:(?!(?:Key[34])).). where later expression is called tampered greedy dot, which according to the expression will still capture any character but stop capturing as soon as it sees a Key3 or Key4 ([34] means either 3 or 4 and just one character) and that is what we exactly wanted. We want to capture anything except when it is either Key3 key or Key4 key and that's how it works. Feel free if you still have any doubts.

Here just using . won't suffice as either its greedy or non-greedy version will either consume all or none characters, hence you need a tempered greedy dot that ignores capturing Key3 or Key4.

Check this updated Demo,

Updated Demo

If this works for you, I will add explanation to my regex. (Now added above)

On another simpler note, I feel it will be better to just use these two following regexes to capture Key3 and Key4 separately as the regex would be much simpler to write and maintain,

Placeholder1[\w\W]*?Key3=([a-zA-Z0-9 -]*) (For finding Key3's value)
Placeholder1[\w\W]*?Key4=([a-zA-Z0-9 -]*) (For finding Key4's value)

One more benefit of this approach you will get is, it will be immune to the order of Key3 and Key4 appearing in your string.

Sign up to request clarification or add additional context in comments.

5 Comments

Hey, thanks. But if I change Key4 to Key5 in the text, it doesn't capture Key3.
I mean, if Key4 isn't there, I would still want Key3 to be captured, but right now it isn't.
@Cauthon: Ok, if both can be optionally present/absent then you need a little more sophisticated regex. Please check my updated answer.
Wow, that works perfectly, but I'm not sure I really understand it completely. I'd be happy to read an explanation. And you might certainly be right about simply splitting the regex up for each key...
@Cauthon: Glad it worked for you. Let me add the explanation to my updated regex.
0

i suppose you want to get only key3 and key4, if that's the case you could use OR operator in regex too the syntax is (|). so change your regex to .*Key3=([a-zA-Z0-9 -]*)|.*Key4=([a-zA-Z0-9 -]*) it will try to match any Key3 or Key4 and if not move to next line but also remember to add MULTILINE tag to your regex function call.

2 Comments

Thanks! This works for the basic example, but I just updated it with a more general one. The | works but it also disregards my Placeholder1 (so it takes both lines of key-value pairs in the example). I just need the pairs from right after the placeholder. How do I let | know to only start from the current line and not from the whole text?
It's a bit tricky but a handy solution is to extract all the line with desired "Placeholder" with a regex like Placeholder1\n.* and then try to extract key values using the expression i mentioned earlier without being concerned about the placeholder.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.