How to group Pandas data frame by column with regex match

Question

I have the following data frame:

import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e'],
                   'XX_111_S5_R12_001_Mobile_05':[-14,-90,-90,-96,-91],
                   'YY_222_S00_R12_001_1-999_13':[-103,0,-110,-114,-114],
                   'ZZ_111_S00_R12_001_1-999_13':[1,2.3,3,5,6],
})

df.set_index('id',inplace=True)
df

Which looks like this:

Out[6]:
    XX_111_S5_R12_001_Mobile_05  YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id
a                           -14                         -103                          1.0
b                           -90                            0                          2.3
c                           -90                         -110                          3.0
d                           -96                         -114                          5.0
e                           -91                         -114                          6.0

What I want to do is to group the column based on the following regex:

\w+_\w+_\w+_\d+_([\w\d-]+)_\d+

So that in the end it's grouped by Mobile, and 1-999.

What's the way to do it. I tried this but fail to group them:

import re
grouped = df.groupby(lambda x: re.search("\w+_\w+_\w+_\d+_([\w\d-]+)_\d+", x).group(), axis=1)
for name, group in grouped:
    print name
    print group

Which prints:

XX_111_S5_R12_001_Mobile_05
YY_222_S00_R12_001_1-999_13
ZZ_111_S00_R12_001_1-999_13

What we want is name prints to:

Mobile
1-999
1-999

And group prints the corresponding data frame.

Could you give some additional details about what you are trying to achieve? It looks like you are trying to output 3 groups in your groupby, when the original dataframe only has 3 columns anyway. Furthermore, by definition of a groupby, the group names/labels (which you've called name) are unique, so the desired output you described is just not possible; the closest thing would be to create a row of labels (i.e. Mobile and 1-999) and use those in your groups instead, but I'm not sure if this is relevant to what you're trying to do. — Ken Wei
– Ken Wei, Commented Mar 27, 2017 at 6:30

root · Accepted Answer · 2017-03-27 02:08:30Z

You can use .str.extract on the columns in order to extract substrings for your groupby:

# Performing the groupby.
pat = '\w+_\w+_\w+_\d+_([\w\d-]+)_\d+'
grouped = df.groupby(df.columns.str.extract(pat, expand=False), axis=1)

# Showing group information.
for name, group in grouped:
    print name
    print group, '\n'

Which returns the expected groups:

1-999
    YY_222_S00_R12_001_1-999_13  ZZ_111_S00_R12_001_1-999_13
id                                                          
a                          -103                          1.0
b                             0                          2.3
c                          -110                          3.0
d                          -114                          5.0
e                          -114                          6.0 

Mobile
    XX_111_S5_R12_001_Mobile_05
id                             
a                           -14
b                           -90
c                           -90
d                           -96
e                           -91

DYZ · Accepted Answer · 2017-03-27 01:49:51Z

1

After grouping, set the index of the new dataframe to [re.findall(r'\w+_\w+_\w+_\d+_([\w\d-]+)_\d+', col)[0] for col in df.columns] (which is ['Mobile', '1-999', '1-999']).

edited Mar 27, 2017 at 1:49

answered Mar 27, 2017 at 1:48

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

1 Comment

DYZ Over a year ago

Looks like I overlooked your question, based on the wrong description. The problem that you have is not related to grouping. It is related to indexing.

akuiper · Accepted Answer · 2017-03-27 02:10:35Z

1

You have some issues with your regex, \w matches word characters which include underscore, and that doesn't seem like what you want, if you just want to match letters and digits, using A-Za-z0-9- would be better:

df.groupby(df.columns.str.extract("([A-Za-z0-9-]+)_\d+$"), axis=1).sum()

answered Mar 27, 2017 at 2:10

akuiper

216k33 gold badges362 silver badges379 bronze badges

Collectives™ on Stack Overflow

How to group Pandas data frame by column with regex match

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related