Create multiple new dataframe columns from substring/regex matches in single column

Question

I have a pandas dataframe with a catch-all column called "Misc", which contains optional sequences of characters. For example:

    Misc
    1. xxx=something;yyyblah=somethingelse;xyx=blah
    2. xyz=meh;yzxx=random;xyx=meh

I am really only interested in 4-5 values/cases of something=something; and I would like to create new columns and add them to my dataframe for those instances, and "." or NaN if they do not exist. So if I was interested in xxx= ... ; and xyx=...; my code would do the following:

    Misc                                                xxx          xyx
    1. xxx=something;yyyblah=somethingelse;xyx=blah |  something  |  blah
    2. xyz=meh;yzxx=random;xyx=meh                  |  .          |  meh

All of the information in Misc will begin with a set of 20-30 strings, and end with ";". I have tried using regexes ...

    df['xxx'] = df.Misc.str.extract(r'*(xxx=)*;)$', expand=True)

but that does not seem to be working. I also thought about simply removing all instances I do not care about, and then splitting so I get consistency. Any ideas?

Andrej Kesely · Accepted Answer · 2021-04-19 00:30:06Z

2

To expand all parameters, you can use .str.extractall():

x = (
    df.Misc.str.extractall(r"([^=\s]+)=([^;]+);?")
    .groupby(level=0)[[0, 1]]
    .apply(lambda x: dict(zip(x[0], x[1])))
    .apply(pd.Series)
    .fillna("N/A")
)

df_out = pd.concat([df, x], axis=1)
print(df_out)

Prints:

                                              Misc        xxx        yyyblah   xyx  xyz    yzxx
0  1. xxx=something;yyyblah=somethingelse;xyx=blah  something  somethingelse  blah  N/A     N/A
1                   2. xyz=meh;yzxx=random;xyx=meh        N/A            N/A   meh  meh  random

answered Apr 19, 2021 at 0:30

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wwnde · Accepted Answer · 2021-04-19 01:27:12Z

2

Please try Named groups.

df.Misc.str.extract('(?P<xxx>(?<=^xxx\=)\w+)|(?P<xyx>(?<=xyx\=)\w+$)')

Or

Use (?<=X)Y also known as Positive lookbehind assertion. Where Y is matched only if X is on its left. Chain this with str.extract.

df[['xxx','xyx']]=df.Misc.str.extract('((?<=^xxx\=)\w+)'),df.Misc.str.extract('((?<=xyx\=)\w+$)')

Either solution should result into

      Misc                                            xxx   xyx
0  xxx=something;yyyblah=somethingelse;xyx=blah  something  blah
1                   xyz=meh;yzxx=random;xyx=meh        NaN   meh

edited Apr 19, 2021 at 1:27

answered Apr 19, 2021 at 0:54

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

5 Comments

ShaneK Over a year ago

I like the way this was done as well, though I am receiving ""None of [Index(['xxx', 'xyx'], dtype='object')] are in the [columns]""

wwnde Over a year ago

Which specific line is giving you an error?

ShaneK Over a year ago

['xxx','xyx']]=df.Misc.str.extract('((?<=^xxx\=)\w+)'),df.Misc.str.extract('((?<=xyx\=)\w+$)')

wwnde Over a year ago

You mean df[['xxx','xyx']]?

ShaneK Over a year ago

Yeah sorry, my copy/paste skills were lacking on that comment.

tdy · Accepted Answer · 2021-04-19 03:31:58Z

1

Change the capture group to match after xxx= instead of xxx= itself. The (?:;|$) checks for either ; or end-of-line as terminators.

df['xxx'] = df.Misc.str.extract(r'xxx=(.*?)(?:;|$)', expand=True)
df['xyx'] = df.Misc.str.extract(r'xyx=(.*?)(?:;|$)', expand=True)

Or you can assign() these columns automatically in a comprehension:

keys = ['xxx', 'xyx']
df = df.assign(**{key: df.Misc.str.extract(rf'{key}=(.*?)(?:;|$)', expand=True) for key in keys})

Output:

#                                               Misc                     xxx   xyx
# 0  1. xxx=something;yyyblah=somethingelse;xyx=blah               something  blah
# 1                   2. xyz=meh;yzxx=random;xyx=meh                     NaN   meh
# 2                             3. xxx=foo;xxxxy=bar                     foo   NaN
# 3              4. xxx=meh,blah/other=super 3;zzz=1  meh,blah/other=super 3   NaN

Timings

I couldn't get Andrej's answer to work on my end (reindexing error), but these are the other timings with 40K rows:

>>> df = pd.DataFrame({'Misc':['1. xxx=something;yyyblah=somethingelse;xyx=blah','2. xyz=meh;yzxx=random;xyx=meh','3. xxx=foo;xxxxy=bar','4. xxx=meh,blah/other=super 3;zzz=1']})
>>> df = pd.concat([df]*10000)

>>> %timeit tdy(df)
75.5 ms ± 5.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit wwnde(df)
83.6 ms ± 1.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Apr 19, 2021 at 3:31

answered Apr 19, 2021 at 0:19

tdy

42k42 gold badges124 silver badges125 bronze badges

4 Comments

ShaneK Over a year ago

With your approach (very simple, and probably computationally efficient, since I aim to throw out the others regardless), I get "wrong number of items passed 4, placement implies 1". Looking at the data there are a few spots that could trip this up, though I do not see issued with the regex. 1. Categories can be substrings, like xxx= xxxxy=. I do not think this is a problem 2. Categories can have more than one "=" like xxx=meh,blah/other=super 3. All kinds of punctuation can occur after "=", except ";", which is the termination char

tdy Over a year ago

@ShaneK Ah sorry, had a typo. Fixed now. I tested with those 2 extra cases and they work on my end. Updated the answer with those examples.

tdy Over a year ago

@ShaneK Also added timings with 40,000 rows.

ShaneK Over a year ago

All of you are awesome, thanks for taking time out of your life to share your knowledge. @tdy, this worked for me.

Collectives™ on Stack Overflow

Create multiple new dataframe columns from substring/regex matches in single column

3 Answers 3

Comments

5 Comments

Timings

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

Timings

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related