Group by in pandas dataframe and unioning a numpy array column

Question

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following

first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]

When I load the this CSV with pandas data frame and using the following code

data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})

Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like

170.0, [19 23 234 376]
162.0, [1 2 3 4]

How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.

group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))

Are you sure your second column is a np array? when I run your code I get object as the dtype which indicates to me it's in fact a string., can you post the output from data.info() and also data['second'].iloc[0] — EdChum
– EdChum, Commented Aug 20, 2015 at 9:19
Yes, you are right it is loaded as an object. Here is how it looks like <br/> <class 'pandas.core.frame.DataFrame'> Int64Index: 4 entries, 0 to 3 Data columns (total 2 columns): first 4 non-null float64 second 4 non-null object dtypes: float64(1), object(1) <br/> How do I load them as numpy arrays? — Ram
– Ram, Commented Aug 20, 2015 at 9:24
Try this: data = pd.read_csv('myfile.csv', header=None, names=['first','second'], converters = {'first': np.float64, 'second': np.array}) The problem with your code is that your file does not have any header names so your converters will not find a match — EdChum
– EdChum, Commented Aug 20, 2015 at 9:35
Sorry! I have corrected the original question. My file did have column names as the first line. — Ram
– Ram, Commented Aug 20, 2015 at 9:45
Well you have inconsistent separators, some lines have spaces others don't as well as commas, is this correct? — EdChum
– EdChum, Commented Aug 20, 2015 at 9:50

TimCera · Accepted Answer · 2015-08-21 23:24:53Z

4

With your current csv file the 'third' column comes in as a string, instead of a list.

There might be nicer ways to convert to a list, but here goes...

from ast import literal_eval

data = pd.read_csv('test_groupby.csv')

# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')

# Convert string to list...
data['third'] = data['third'].apply(literal_eval)

group_data=data.groupby('first')

# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
#     (np.array should work here, but...?)
ans = group_data.aggregate(
      {'third': lambda x: list(np.unique(
                               np.concatenate(x.values)))})

print(ans)
                    third
first                    
162          [1, 2, 3, 4]
170    [19, 23, 234, 376]

answered Aug 21, 2015 at 23:24

TimCera

5803 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ram Over a year ago

Thank you very much! That works well! Thanks for literal_eval, which is the key!

Collectives™ on Stack Overflow

Group by in pandas dataframe and unioning a numpy array column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related