0

I currently have a pandas DataFrame df with the size of 168078 rows × 43 columns. A summary of df is shown below:

              doi                           gender       order       year       ...       count
9384155       10.1103/PRL.102.039801        male         1           2009       ...       1
...
3679211       10.1103/PRD.69.024009         male         2           2004       ...       501

The df is currently sorted by count, and therefore varies from 1 to 501.

I would like to split the df into 501 smaller subdata by splitting it by count. In other words, at the end of the process, I would have 501 different sub-df with each characteristic count value.

Since the number of resulting (desired) DataFrames is quite high, and since it is a quantitative data, I was wondering if:

a) it is possible to split the DataFrame that many times (if yes, then how), and

b) it is possible to name each DataFrame quantitatively without manually assigning a name 501 times; i.e. for example, df with count == 1 would be df.1 without having to assign it.

2 Answers 2

1

The best practice you can do is create a dictionary of data frames. Below I show you an example:

df=pd.DataFrame({'A':[4,5,6,7,7,5,4,5,6,7],
                 'count':[1,2,3,4,5,6,7,8,9,10],
                 'C':['a','b','c','d','e','f','g','h','i','j']})
print(df)

   A  count  C
0  4      1  a
1  5      2  b
2  6      3  c
3  7      4  d
4  7      5  e
5  5      6  f
6  4      7  g
7  5      8  h
8  6      9  i
9  7     10  j

Now we create the dictionary. As you can see the key is the value of count in each row. keep in mind that here Series.unique is used to make that in the case where there are two rows with the same count value then they are created in the same dictionary.

dfs={key:df[df['count']==key] for key in df['count'].unique()}

Below I show the content of the entire dictionary created and how to access it:

for key in dfs:
    print(f'dfs[{key}]')
    print(dfs[key])
    print('-'*50)


dfs[1]
   A  count  C
0  4      1  a
--------------------------------------------------
dfs[2]
   A  count  C
1  5      2  b
--------------------------------------------------
dfs[3]
   A  count  C
2  6      3  c
--------------------------------------------------
dfs[4]
   A  count  C
3  7      4  d
--------------------------------------------------
dfs[5]
   A  count  C
4  7      5  e
--------------------------------------------------
dfs[6]
   A  count  C
5  5      6  f
--------------------------------------------------
dfs[7]
   A  count  C
6  4      7  g
--------------------------------------------------
dfs[8]
   A  count  C
7  5      8  h
--------------------------------------------------
dfs[9]
   A  count  C
8  6      9  i
--------------------------------------------------
dfs[10]
   A  count  C
9  7     10  j
--------------------------------------------------
Sign up to request clarification or add additional context in comments.

Comments

1

you can just use groupby to get the result like below here g.groups: will give group name (group id) for each group g.get_group: will give you one group with given group name

import numpy as np
import pandas as pd

df=pd.DataFrame({'A':np.random.choice(["a","b","c", "d"], 10),
                 'count':np.random.choice(10,10)
                })

g = df.groupby("count")
for key in g.groups:
    print(g.get_group(key))
    print("\n---------------")

Result

   A  count
3  c      0

---------------
   A  count
9  a      2

---------------
   A  count
0  c      3
2  b      3

---------------
   A  count
1  b      4
5  d      4
6  a      4
7  b      4

---------------
   A  count
8  c      5

---------------
   A  count
4  d      8

---------------

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.