Pandas & python: split dataframe into many dataframes based on column value containing substring

Question

I have a dataframe with potentially millions of rows like the following:

df:
     name value
1     bob1   abc
2     bob3   def
3     jake2  ghi
4     jake   jkl 
5     sam1   mno
6     bob5   pqr

How can I split this into multiple dataframes based on the name column values containing some substring, such as 'bob', 'jake', and 'sam' in this example?

The new dataframes can be still kept in one data structure such as a dictionary if this changes anything.

Desired dataframes:

df1:
     name value
1     bob1   abc
2     bob3   def
3     bob5   pqr

df2:
     name value
1     jake2  ghi
2     jake   jkl 

df3:
     name value
1     sam1   mno

What rule do you propose for splitting? Name (minus any integers)? — jpp
– jpp, Commented Feb 14, 2018 at 19:22
Yes precisely, name minus and trailing integers. I do not know the names ahead of time however, but they will always be letters , possibly containing an integer at the end. — bwrabbit
– bwrabbit, Commented Feb 14, 2018 at 19:24

Espoir Murhabazi · Accepted Answer · 2018-02-14 19:40:21Z

1

here is another approach:

get all different values :

def matching_function(x):
    match = re.match(r"([a-z]+)([0-9]+)", x, re.I)
    if match:
        return match.group(1)

The function remove the mumber from string , thanks for this answer Get all possibles values of names :

set(df.name.apply(matching_function))

Loop to those values and split the df:

df_list= []
for x in set(df.name.apply(matching_function)):
    if x :
        df_list.append(df.loc[df.name.apply(lambda y : y.startswith( x ))])

df_list contains splited dataframes

answered Feb 14, 2018 at 19:40

Espoir Murhabazi

6,4415 gold badges49 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jpp · Accepted Answer · 2018-02-14 19:27:43Z

0

This works. Note my dictionary keys are names, because that seemed most logical.

# get set of names
names = set(df.name.str.replace('\d+', ''))

# make dictionary
dfs = {n: df[df.name.str.replace('\d+', '') == n] for n in names}

# {'jake':     name value
# 3  jake2   ghi
# 4   jake   jkl,
#  'bob':    name value
# 1  bob1   abc
# 2  bob3   def
# 6  bob5   pqr,
#  'sam':    name value
# 5  sam1   mno}

answered Feb 14, 2018 at 19:27

jpp

166k37 gold badges301 silver badges362 bronze badges

Comments

BENY · Accepted Answer · 2018-02-14 19:27:57Z

0

IIUC

l=[y for _,y in df.groupby(df.name.str.replace('\d+', ''))]
Out[207]: 
l
[   name value
 1  bob1   abc
 2  bob3   def
 6  bob5   pqr,     name value
 3  jake2   ghi
 4   jake   jkl,    name value
 5  sam1   mno]

answered Feb 14, 2018 at 19:27

BENY

324k22 gold badges176 silver badges250 bronze badges

Collectives™ on Stack Overflow

Pandas & python: split dataframe into many dataframes based on column value containing substring

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related