2

I have a dataframe with potentially millions of rows like the following:

df:
     name value
1     bob1   abc
2     bob3   def
3     jake2  ghi
4     jake   jkl 
5     sam1   mno
6     bob5   pqr

How can I split this into multiple dataframes based on the name column values containing some substring, such as 'bob', 'jake', and 'sam' in this example?

The new dataframes can be still kept in one data structure such as a dictionary if this changes anything.

Desired dataframes:

df1:
     name value
1     bob1   abc
2     bob3   def
3     bob5   pqr

df2:
     name value
1     jake2  ghi
2     jake   jkl 

df3:
     name value
1     sam1   mno
2
  • What rule do you propose for splitting? Name (minus any integers)? Commented Feb 14, 2018 at 19:22
  • Yes precisely, name minus and trailing integers. I do not know the names ahead of time however, but they will always be letters , possibly containing an integer at the end. Commented Feb 14, 2018 at 19:24

3 Answers 3

1

here is another approach:

get all different values :

def matching_function(x):
    match = re.match(r"([a-z]+)([0-9]+)", x, re.I)
    if match:
        return match.group(1)

The function remove the mumber from string , thanks for this answer Get all possibles values of names :

set(df.name.apply(matching_function))

Loop to those values and split the df:

df_list= []
for x in set(df.name.apply(matching_function)):
    if x :
        df_list.append(df.loc[df.name.apply(lambda y : y.startswith( x ))])

df_list contains splited dataframes

Sign up to request clarification or add additional context in comments.

Comments

0

This works. Note my dictionary keys are names, because that seemed most logical.

# get set of names
names = set(df.name.str.replace('\d+', ''))

# make dictionary
dfs = {n: df[df.name.str.replace('\d+', '') == n] for n in names}

# {'jake':     name value
# 3  jake2   ghi
# 4   jake   jkl,
#  'bob':    name value
# 1  bob1   abc
# 2  bob3   def
# 6  bob5   pqr,
#  'sam':    name value
# 5  sam1   mno}

Comments

0

IIUC

l=[y for _,y in df.groupby(df.name.str.replace('\d+', ''))]
Out[207]: 
l
[   name value
 1  bob1   abc
 2  bob3   def
 6  bob5   pqr,     name value
 3  jake2   ghi
 4   jake   jkl,    name value
 5  sam1   mno]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.