2

I have a DataFrame with 3 columns and 1,000+ rows,

df 
   day         product         order
2010-01-01    150ml Mask          9
2010-01-02    230ml Lotion       27
2010-01-03    600ml Shampoo      33

And I would like to subset each product as following,

 df_mask                 df_lotion            df_shampoo  
   day        order        day       order     day         order
2010-01-01      9       2010-01-02    27      2010-01-03    33   
2010-01-09      8       2010-01-05    30      2010-01-04    25
2010-01-11     13       2010-01-06    29      2010-01-06    46

This is how I do it,

# Create a product list 
productName = df['product'].tolist()

# Subsetting
def subtable(df,productName):
    return (df[(df['product'] == productName)])

# Subsetting
df_mask = subtable(df, '150ml Mask')
df_lotion = subtable(df, '230ml Lotion')
df_shampoo = subtable(df, '230ml Shampoo')

Is there any way I can get all the subsets one time using for loop since the data frame has many different products.

3 Answers 3

4

You can use groupby for this purpose which does exactly what you need:

# show example data
print(df)

     day           product             order
0    2010-01-01    "150ml Mask"          9
1    2010-01-02    "230ml Lotion"       27
2    2010-01-03    "600ml Shampoo"      33
3    2010-01-04    "250ml Mask"         12
4    2010-01-05    "330ml Lotion"       24
5    2010-01-06    "400ml Shampoo"      13

# split product column and keep only product name
df["product"] = df["product"].str.split(expand=True)[1]

# groupby product
products = df.groupby("product")

# print product and corresponding product df
for product, product_df in products:
    print(product)
    print(product_df)

Lotion
          day product  order
1  2010-01-02  Lotion     27
4  2010-01-05  Lotion     24

Mask
          day product  order
0  2010-01-01    Mask      9
3  2010-01-04    Mask     12

Shampoo
          day  product  order
2  2010-01-03  Shampoo     33
5  2010-01-06  Shampoo     13

In order to access each sub group individually, you can use get_group which corresponds to your subtable function:

mask_df = products.get_group("Mask")
print(mask_df)

    day         product     order
0   2010-01-01  Mask        9
3   2010-01-04  Mask        12

Finally, to get all sub data frames within one dictionary, you can loop over products and drop the product-column itself:

df_dict = {product: product_df.drop("product", axis=1) 
          for product, product_df in products}
print(df_dict["Mask"])

    day         order
0   2010-01-01  9
3   2010-01-04  12
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your answer. I tried df["product"] = df["product"].str.split(expand=True)[1], but some product names are not organized since some product names look like 0.7OZ Mask UK 6 . Is there other way to fix the problem?
@peggy What are the possible variations of the product labels? Extracting the product name completely depends on your input data. However, for your given example in your comment, df["product"].str.split(expand=True)[1] should sucessfully extract Mask from 0.7OZ Mask UK 6. Or do you need Mask including the UK 6?
Yes. I will need Mask UK 6 . But I decided to assign each product a particular number to make sorting easier. Other than that, the codes run pretty well. Thank you very much!
0

See if it helps:

dfs = {}
for grp in df.groupby('product'):
    dfs[grp[0].split(' ')[1]] = grp[1] # split gives you the product name as key

for key in dfs.keys():
    print dfs[key]

Comments

0

I think you can use dict for storage all DataFrames, which is created dict comprehension with groupby and split:

producs = df['product'].str.split().str[-1]
print (producs)
0       Mask
1     Lotion
2    Shampoo
Name: product, dtype: object

dfs = {i:df.reset_index(drop=True) for i, df in df.groupby(producs)}
print (dfs)
{'Shampoo':           day        product  order
0  2010-01-03  600ml Shampoo     33, 'Mask':           day     product  order
0  2010-01-01  150ml Mask      9, 'Lotion':           day       product  order
0  2010-01-02  230ml Lotion     27}

print (dfs['Shampoo'])
          day        product  order
0  2010-01-03  600ml Shampoo     33

If you need remove column product use subset [['day','order']] or drop:

dfs = {i:df.reset_index(drop=True)[['day','order']] for i, df in df.groupby(producs)}
#dfs = {i:df.reset_index(drop=True).drop('product', axis=1) for i, df in df.groupby(producs)}
print (dfs)
{'Shampoo':           day  order
0  2010-01-03     33, 'Mask':           day  order
0  2010-01-01      9, 'Lotion':           day  order
0  2010-01-02     27}

print (dfs['Shampoo'])
          day  order
0  2010-01-03     33

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.