Efficient method to split dataframe multiple times in Python?

Question

I currently have a pandas DataFrame df with the size of 168078 rows × 43 columns. A summary of df is shown below:

              doi                           gender       order       year       ...       count
9384155       10.1103/PRL.102.039801        male         1           2009       ...       1
...
3679211       10.1103/PRD.69.024009         male         2           2004       ...       501

The df is currently sorted by count, and therefore varies from 1 to 501.

I would like to split the df into 501 smaller subdata by splitting it by count. In other words, at the end of the process, I would have 501 different sub-df with each characteristic count value.

Since the number of resulting (desired) DataFrames is quite high, and since it is a quantitative data, I was wondering if:

a) it is possible to split the DataFrame that many times (if yes, then how), and

b) it is possible to name each DataFrame quantitatively without manually assigning a name 501 times; i.e. for example, df with count == 1 would be df.1 without having to assign it.

ansev · Accepted Answer · 2019-10-19 12:35:42Z

The best practice you can do is create a dictionary of data frames. Below I show you an example:

df=pd.DataFrame({'A':[4,5,6,7,7,5,4,5,6,7],
                 'count':[1,2,3,4,5,6,7,8,9,10],
                 'C':['a','b','c','d','e','f','g','h','i','j']})
print(df)

   A  count  C
0  4      1  a
1  5      2  b
2  6      3  c
3  7      4  d
4  7      5  e
5  5      6  f
6  4      7  g
7  5      8  h
8  6      9  i
9  7     10  j

Now we create the dictionary. As you can see the key is the value of count in each row. keep in mind that here Series.unique is used to make that in the case where there are two rows with the same count value then they are created in the same dictionary.

dfs={key:df[df['count']==key] for key in df['count'].unique()}

Below I show the content of the entire dictionary created and how to access it:

for key in dfs:
    print(f'dfs[{key}]')
    print(dfs[key])
    print('-'*50)


dfs[1]
   A  count  C
0  4      1  a
--------------------------------------------------
dfs[2]
   A  count  C
1  5      2  b
--------------------------------------------------
dfs[3]
   A  count  C
2  6      3  c
--------------------------------------------------
dfs[4]
   A  count  C
3  7      4  d
--------------------------------------------------
dfs[5]
   A  count  C
4  7      5  e
--------------------------------------------------
dfs[6]
   A  count  C
5  5      6  f
--------------------------------------------------
dfs[7]
   A  count  C
6  4      7  g
--------------------------------------------------
dfs[8]
   A  count  C
7  5      8  h
--------------------------------------------------
dfs[9]
   A  count  C
8  6      9  i
--------------------------------------------------
dfs[10]
   A  count  C
9  7     10  j
--------------------------------------------------

Dev Khadka · Accepted Answer · 2019-10-19 14:14:04Z

you can just use groupby to get the result like below here g.groups: will give group name (group id) for each group g.get_group: will give you one group with given group name

import numpy as np
import pandas as pd

df=pd.DataFrame({'A':np.random.choice(["a","b","c", "d"], 10),
                 'count':np.random.choice(10,10)
                })

g = df.groupby("count")
for key in g.groups:
    print(g.get_group(key))
    print("\n---------------")

Result

   A  count
3  c      0

---------------
   A  count
9  a      2

---------------
   A  count
0  c      3
2  b      3

---------------
   A  count
1  b      4
5  d      4
6  a      4
7  b      4

---------------
   A  count
8  c      5

---------------
   A  count
4  d      8

---------------

Collectives™ on Stack Overflow

Efficient method to split dataframe multiple times in Python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related