7

I'm very lost with a problem and some help or tips will be appreciated.

The problem: I've a csv file with a column with the possibility of multiple values like:

Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1
Orange;Green;something2
Apple;Red;something2
Apple;Red;something3

I've loaded the data into a dataframe and i need to split that dataframe into multiple dataframes based on the value of the column "The_evil_column":

df1
Fruit;Color;The_evil_column
Apple;Red;something1
Apple;Green;something1
Orange;Orange;something1

df2
Fruit;Color;The_evil_column
Orange;Green;something2
Apple;Red;something2

df3
Fruit;Color;The_evil_column
Apple;Red;something3

After reading some posts i'm even more confused and i need some tip about this please.

3 Answers 3

12

you can generate a dictionary of DataFrames:

d = {g:x for g,x in df.groupby('The_evil_column')}

In [95]: d.keys()
Out[95]: dict_keys(['something1', 'something2', 'something3'])

In [96]: d['something1']
Out[96]:
    Fruit   Color The_evil_column
0   Apple     Red      something1
1   Apple   Green      something1
2  Orange  Orange      something1

or a list of DataFrames:

In [103]: l = [x for _,x in df.groupby('The_evil_column')]

In [104]: l[0]
Out[104]:
    Fruit   Color The_evil_column
0   Apple     Red      something1
1   Apple   Green      something1
2  Orange  Orange      something1

In [105]: l[1]
Out[105]:
    Fruit  Color The_evil_column
3  Orange  Green      something2
4   Apple    Red      something2

In [106]: l[2]
Out[106]:
   Fruit Color The_evil_column
5  Apple   Red      something3

UPDATE:

In [111]: g = pd.read_csv(filename, sep=';').groupby('The_evil_column')

In [112]: g.ngroups   # number of unique values in the `The_evil_column` column
Out[112]: 3

In [113]: g.apply(lambda x: x.to_csv(r'c:\temp\{}.csv'.format(x.name)))
Out[113]:
Empty DataFrame
Columns: []
Index: []

will produce 3 files:

In [115]: glob.glob(r'c:\temp\something*.csv')
Out[115]:
['c:\\temp\\something1.csv',
 'c:\\temp\\something2.csv',
 'c:\\temp\\something3.csv']
Sign up to request clarification or add additional context in comments.

8 Comments

Loved that dict generation ( very nice to know !) but what i wanted was the list what worked perfectly, now ill try to make the funtion to store the data with to_csv after counting how many evil things has the evil column. Thank you so much !!
@EliasCortAguelo, glad i could help. What is your end-goal? To split one CSV by The_evil_column colum?
Yes thats the idea i've a var named counter with value of 0 and a for loop like """for result in range(len(d)): counter += 1 print l[counter]""", it returns the 3 dataframes but gives a final error with "IndexError: list index out of range""".
Wow incredible, thank you !! That was exactly what i needed, so nice really. Very very thank you MaxU, learned a lot with your help !!
that's why to access the access the df is via l[i][1] ;)
|
0

you can just filter the frame by the value of the column:

frame=pd.read_csv('file.csv',delimiter=';')
frame['The_evil_column']=='something1'

this returns:

0     True
1     True
2     True
3    False
4    False
5    False
Name: The_evil_column, dtype: bool

Therefore you access these columns:

frame1 = frame[frame['The_evil_column']=='something1']

Later you can drop the column:

frame1 = frame1.drop('The_evil_column', axis=1)

Comments

0

Simpler but less efficient way is:

data = pd.read_csv('input.csv')

out = []

for evil_element in list(set(list(data['The_evil_column']))):
    out.append(data[data['The_evil_column']==evil_element])

out will have list of all data dataframes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.