Pandas split CSV into multiple CSV's (or DataFrames) by a column

Question

I'm very lost with a problem and some help or tips will be appreciated.

The problem: I've a csv file with a column with the possibility of multiple values like:

Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1
Orange;Green;something2
Apple;Red;something2
Apple;Red;something3

I've loaded the data into a dataframe and i need to split that dataframe into multiple dataframes based on the value of the column "The_evil_column":

df1
Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1

df2
Fruit;Color;The_evil_column
Orange;Green;something2
Apple;Red;something2

df3
Fruit;Color;The_evil_column
Apple;Red;something3

After reading some posts i'm even more confused and i need some tip about this please.

MaxU - stand with Ukraine · Accepted Answer · 2017-12-28 12:19:07Z

12

you can generate a dictionary of DataFrames:

d = {g:x for g,x in df.groupby('The_evil_column')}

In [95]: d.keys()
Out[95]: dict_keys(['something1', 'something2', 'something3'])

In [96]: d['something1']
Out[96]:
    Fruit   Color The_evil_column
0   Apple     Red      something1
1   Apple   Green      something1
2  Orange  Orange      something1

or a list of DataFrames:

In [103]: l = [x for _,x in df.groupby('The_evil_column')]

In [104]: l[0]
Out[104]:
    Fruit   Color The_evil_column
0   Apple     Red      something1
1   Apple   Green      something1
2  Orange  Orange      something1

In [105]: l[1]
Out[105]:
    Fruit  Color The_evil_column
3  Orange  Green      something2
4   Apple    Red      something2

In [106]: l[2]
Out[106]:
   Fruit Color The_evil_column
5  Apple   Red      something3

UPDATE:

In [111]: g = pd.read_csv(filename, sep=';').groupby('The_evil_column')

In [112]: g.ngroups   # number of unique values in the `The_evil_column` column
Out[112]: 3

In [113]: g.apply(lambda x: x.to_csv(r'c:\temp\{}.csv'.format(x.name)))
Out[113]:
Empty DataFrame
Columns: []
Index: []

will produce 3 files:

In [115]: glob.glob(r'c:\temp\something*.csv')
Out[115]:
['c:\\temp\\something1.csv',
 'c:\\temp\\something2.csv',
 'c:\\temp\\something3.csv']

edited Dec 28, 2017 at 12:19

answered Dec 28, 2017 at 11:55

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

ECA Over a year ago

Loved that dict generation ( very nice to know !) but what i wanted was the list what worked perfectly, now ill try to make the funtion to store the data with to_csv after counting how many evil things has the evil column. Thank you so much !!

MaxU - stand with Ukraine Over a year ago

@EliasCortAguelo, glad i could help. What is your end-goal? To split one CSV by The_evil_column colum?

ECA Over a year ago

Yes thats the idea i've a var named counter with value of 0 and a for loop like """for result in range(len(d)): counter += 1 print l[counter]""", it returns the 3 dataframes but gives a final error with "IndexError: list index out of range""".

ECA Over a year ago

Wow incredible, thank you !! That was exactly what i needed, so nice really. Very very thank you MaxU, learned a lot with your help !!

rpanai Over a year ago

that's why to access the access the df is via l[i][1] ;)

|

Bartłomiej · Accepted Answer · 2017-12-28 12:02:57Z

0

you can just filter the frame by the value of the column:

frame=pd.read_csv('file.csv',delimiter=';')
frame['The_evil_column']=='something1'

this returns:

0     True
1     True
2     True
3    False
4    False
5    False
Name: The_evil_column, dtype: bool

Therefore you access these columns:

frame1 = frame[frame['The_evil_column']=='something1']

Later you can drop the column:

frame1 = frame1.drop('The_evil_column', axis=1)

answered Dec 28, 2017 at 12:02

Bartłomiej

1,0781 gold badge14 silver badges23 bronze badges

Comments

Rahul Chawla · Accepted Answer · 2017-12-28 12:07:50Z

0

Simpler but less efficient way is:

data = pd.read_csv('input.csv')

out = []

for evil_element in list(set(list(data['The_evil_column']))):
    out.append(data[data['The_evil_column']==evil_element])

out will have list of all data dataframes.

answered Dec 28, 2017 at 12:07

Rahul Chawla

1,08810 silver badges16 bronze badges

Collectives™ on Stack Overflow

Pandas split CSV into multiple CSV's (or DataFrames) by a column

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related