2

I have a pandas dataframe with a catch-all column called "Misc", which contains optional sequences of characters. For example:

    Misc
    1. xxx=something;yyyblah=somethingelse;xyx=blah
    2. xyz=meh;yzxx=random;xyx=meh

I am really only interested in 4-5 values/cases of something=something; and I would like to create new columns and add them to my dataframe for those instances, and "." or NaN if they do not exist. So if I was interested in xxx= ... ; and xyx=...; my code would do the following:

    Misc                                                xxx          xyx
    1. xxx=something;yyyblah=somethingelse;xyx=blah |  something  |  blah
    2. xyz=meh;yzxx=random;xyx=meh                  |  .          |  meh

All of the information in Misc will begin with a set of 20-30 strings, and end with ";". I have tried using regexes ...

    df['xxx'] = df.Misc.str.extract(r'*(xxx=)*;)$', expand=True)

but that does not seem to be working. I also thought about simply removing all instances I do not care about, and then splitting so I get consistency. Any ideas?

3 Answers 3

2

To expand all parameters, you can use .str.extractall():

x = (
    df.Misc.str.extractall(r"([^=\s]+)=([^;]+);?")
    .groupby(level=0)[[0, 1]]
    .apply(lambda x: dict(zip(x[0], x[1])))
    .apply(pd.Series)
    .fillna("N/A")
)

df_out = pd.concat([df, x], axis=1)
print(df_out)

Prints:

                                              Misc        xxx        yyyblah   xyx  xyz    yzxx
0  1. xxx=something;yyyblah=somethingelse;xyx=blah  something  somethingelse  blah  N/A     N/A
1                   2. xyz=meh;yzxx=random;xyx=meh        N/A            N/A   meh  meh  random
Sign up to request clarification or add additional context in comments.

Comments

2

Please try Named groups.

df.Misc.str.extract('(?P<xxx>(?<=^xxx\=)\w+)|(?P<xyx>(?<=xyx\=)\w+$)')

Or

Use (?<=X)Y also known as Positive lookbehind assertion. Where Y is matched only if X is on its left. Chain this with str.extract.

df[['xxx','xyx']]=df.Misc.str.extract('((?<=^xxx\=)\w+)'),df.Misc.str.extract('((?<=xyx\=)\w+$)')

Either solution should result into

      Misc                                            xxx   xyx
0  xxx=something;yyyblah=somethingelse;xyx=blah  something  blah
1                   xyz=meh;yzxx=random;xyx=meh        NaN   meh

5 Comments

I like the way this was done as well, though I am receiving ""None of [Index(['xxx', 'xyx'], dtype='object')] are in the [columns]""
Which specific line is giving you an error?
['xxx','xyx']]=df.Misc.str.extract('((?<=^xxx\=)\w+)'),df.Misc.str.extract('((?<=xyx\=)\w+$)')
You mean df[['xxx','xyx']]?
Yeah sorry, my copy/paste skills were lacking on that comment.
1

Change the capture group to match after xxx= instead of xxx= itself. The (?:;|$) checks for either ; or end-of-line as terminators.

df['xxx'] = df.Misc.str.extract(r'xxx=(.*?)(?:;|$)', expand=True)
df['xyx'] = df.Misc.str.extract(r'xyx=(.*?)(?:;|$)', expand=True)

Or you can assign() these columns automatically in a comprehension:

keys = ['xxx', 'xyx']
df = df.assign(**{key: df.Misc.str.extract(rf'{key}=(.*?)(?:;|$)', expand=True) for key in keys})

Output:

#                                               Misc                     xxx   xyx
# 0  1. xxx=something;yyyblah=somethingelse;xyx=blah               something  blah
# 1                   2. xyz=meh;yzxx=random;xyx=meh                     NaN   meh
# 2                             3. xxx=foo;xxxxy=bar                     foo   NaN
# 3              4. xxx=meh,blah/other=super 3;zzz=1  meh,blah/other=super 3   NaN

Timings

I couldn't get Andrej's answer to work on my end (reindexing error), but these are the other timings with 40K rows:

>>> df = pd.DataFrame({'Misc':['1. xxx=something;yyyblah=somethingelse;xyx=blah','2. xyz=meh;yzxx=random;xyx=meh','3. xxx=foo;xxxxy=bar','4. xxx=meh,blah/other=super 3;zzz=1']})
>>> df = pd.concat([df]*10000)

>>> %timeit tdy(df)
75.5 ms ± 5.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit wwnde(df)
83.6 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4 Comments

With your approach (very simple, and probably computationally efficient, since I aim to throw out the others regardless), I get "wrong number of items passed 4, placement implies 1". Looking at the data there are a few spots that could trip this up, though I do not see issued with the regex. 1. Categories can be substrings, like xxx= xxxxy=. I do not think this is a problem 2. Categories can have more than one "=" like xxx=meh,blah/other=super 3. All kinds of punctuation can occur after "=", except ";", which is the termination char
@ShaneK Ah sorry, had a typo. Fixed now. I tested with those 2 extra cases and they work on my end. Updated the answer with those examples.
@ShaneK Also added timings with 40,000 rows.
All of you are awesome, thanks for taking time out of your life to share your knowledge. @tdy, this worked for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.