1

I have already asked a question, but I am facing a problem when I execute my following code for files with over million rows.

Code:

import numpy as np
import pandas as pd
import xlrd 
import xlsxwriter


 df = pd.read_excel('full-cust-data-nonconcat.xlsx')

 df  =df.groupby('ORDER_ID')['ASIN'].agg(','.join).reset_index()

 writer = pd.ExcelWriter('PythonExport-Data.xlsx', engine='xlsxwriter')
 df.to_excel(writer, sheet_name='Sheet1')
 writer.save()

 print df

Error:

Traceback (most recent call last):
 File "grouping-data.py", line 9, in <module>
df  =df.groupby('ORDER_ID')['ASIN'].agg(','.join).reset_index()
 File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2668, in aggregate
  result = self._aggregate_named(func_or_funcs, *args, **kwargs)
  File "/Library/Python/2.7/site-packages/pandas/core/groupby.py", line 2786, in _aggregate_named
   output = func(group, *args, **kwargs)
 TypeError: sequence item 0: expected string, int found

Since its a huge file how can I check where is it finding string and getting int?

Is there any way I can convert all this to string first?

Sample Data: (these ids are alpha numeric)

ID1 Some_other_id1
ID2 Some_other_id2

1 Answer 1

2

You can write a lambda expression in the agg function to do the conversion:

df.groupby('ORDER_ID')['ASIN'].agg(lambda x: ','.join(x.astype(str)).reset_index()

Or convert the data type before aggregation:

df['ASIN'].astype(str).groupby(df['ORDER_ID']).agg(','.join).reset_index()
Sign up to request clarification or add additional context in comments.

2 Comments

THANKS IT WORKED. But when I try it on other file with one more string column it gives error as: File "grouping-data.py", line 11, in <module> df = df['ASIN'].astype(str).groupby(df['ORDER_ID']).agg(','.join).reset_index() File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2059, in getitem return self._getitem_column(key) File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 2066, in _getitem_column return self._get_item_cache(key) File "/Library/Python/2.7/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache values = sel
I am not sure what this is, maybe you can share some of your data that fails the command.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.