2

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following

first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]

When I load the this CSV with pandas data frame and using the following code

data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})

Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like

170.0, [19 23 234 376]
162.0, [1 2 3 4]

How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.

group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))
8
  • Are you sure your second column is a np array? when I run your code I get object as the dtype which indicates to me it's in fact a string., can you post the output from data.info() and also data['second'].iloc[0] Commented Aug 20, 2015 at 9:19
  • Yes, you are right it is loaded as an object. Here is how it looks like <br/> <class 'pandas.core.frame.DataFrame'> Int64Index: 4 entries, 0 to 3 Data columns (total 2 columns): first 4 non-null float64 second 4 non-null object dtypes: float64(1), object(1) <br/> How do I load them as numpy arrays? Commented Aug 20, 2015 at 9:24
  • Try this: data = pd.read_csv('myfile.csv', header=None, names=['first','second'], converters = {'first': np.float64, 'second': np.array}) The problem with your code is that your file does not have any header names so your converters will not find a match Commented Aug 20, 2015 at 9:35
  • Sorry! I have corrected the original question. My file did have column names as the first line. Commented Aug 20, 2015 at 9:45
  • Well you have inconsistent separators, some lines have spaces others don't as well as commas, is this correct? Commented Aug 20, 2015 at 9:50

1 Answer 1

4

With your current csv file the 'third' column comes in as a string, instead of a list.

There might be nicer ways to convert to a list, but here goes...

from ast import literal_eval

data = pd.read_csv('test_groupby.csv')

# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')

# Convert string to list...
data['third'] = data['third'].apply(literal_eval)

group_data=data.groupby('first')

# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
#     (np.array should work here, but...?)
ans = group_data.aggregate(
      {'third': lambda x: list(np.unique(
                               np.concatenate(x.values)))})

print(ans)
                    third
first                    
162          [1, 2, 3, 4]
170    [19, 23, 234, 376]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much! That works well! Thanks for literal_eval, which is the key!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.