0

I have some text that I would like to extract Key=Value pairs from (see below). I've attempted to use a regex however the formatting of key=value pairs is not consistent. For example, many values are enclosed in quotes, some are not.

This is the regex which nearly worked, but there are a couple of outliers.

(\w*)=([\w,\",:,\-,(,\.,\+,\)]*)

Message meets Alert condition date=2020-08-20 time=00:33:57 devname=FGT3HD3999906624 devid=FGT3HD3999906624 logid="0100032003" type="event" subtype="system" level="information" vd="root" eventtime=1597847637407862934 tz="+1000" logdesc="Admin logout successful" sn="159999794" user="admin" ui="https(10.198.199.105)" method="https" srcip=10.198.199.105 dstip=192.168.23.254 action="logout" status="success" duration=4843 reason="timeout" msg="Administrator admin timed out on https(10.198.199.105)" Administrator IT Administrator Ph:

4

2 Answers 2

3

You have a few ways to do this. First, since you said your key-value pairs are embedded in a larger email, you need to extract them. You can do that with this regex, which checks for a line starting with a word and an equals sign:

import re

text = " ... Full email text ... "
dataPoints = re.search(r"^\w*=.*$", text, re.MULTILINE).group(0)

Then you need to create your dictionary.

Option 1: Simplest

Use the following regex find:

result = dict(re.findall(r'(\w*)=(\".*?\"|\S*)', dataPoints))

Regex demo

Option 2: Typical split

Follow the typical method for this problem: split the various key-value combinations into a list, and then split each combination into separate keys and values. However, since your key-value pairs are separated by spaces rather than semicolons, ampersands, or something similar, and some of your values have spaces in them, we can't simply split by spaces. That means we need to use a regex lookahead for this to work properly:

regexSplit = dict([i.split("=") for i in re.split(r"\s(?=\w+=)", dataPoints)])

Option 3: No regex

If you want to avoid using regex altogether for whatever reason, you can use the following, which splits on equals signs and then recombines the keys and values into the proper arrangement for creating a dictionary:

allSplits = dataPoints.split("=")
splitList = [allSplits[0]] + [i for a in allSplits[1:-1] 
    for i in a.rsplit(" ", 1)] + [allSplits[-1]]

splitDict = dict(zip(splitList[::2], splitList[1::2]))

The code above assumes your dictionary will end up with at least 2 items.

Demo for all 3 options

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, this works perfectly and also caters for the dictionary creation.
0

What about adding an OR (|) to your regex, e.g.

(\w*)=(\"[\w\s\+()\.]*\"|[\w\-\:\.]*)

matches the string you gave.
Note

  • \"[\w\s\+()\.]*\" matches all the values enclosed in ""
  • [\w\-\:\.]* matches the ones without

5 Comments

Thanks, the addition of pipe symbol catered for the outliers :)
(\w*)=(\".*?\"|\S*) is much simpler: regex101.com/r/m4o3LO/1
\d is already included in \w, it doesn't make sense to put both in a character class.
@Toto You are right, of course \w maches all alphanumeric characters. I updated the answer.
@jdaz Yes, it looks also way cleaner.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.