2

I'm trying to create code that will take data form certain columns in a CSV file and combine them into a new CSV file. I was directed to use Pandas but I'm not sure if I'm even on the right track. I'm fairly new to Python so prepare yourselves for potentially awful code.

I want to use data.csv:

Customer_ID,Date,Time,OtherColumns,A,B,C,Cost
1003,January,2:00,Stuff,1,5,2,519
1003,January,2:00,Stuff,1,3,2,530
1003,January,2:00,Stuff,1,3,2,530
1004,Feb,2:00,Stuff,1,1,0,699

and create a new CSV that looks like this:

Customer_ID,ABC
1003,152
1003,132
1003,132
1004,110

What I have so far is:

import csv
import pandas as pd

df = pd.read_csv('test.csv', delimiter = ',')
custID = df.customer_ID
choiceA = df.A
choiceB = df.B
choiceC = df.C

ofile  = open('answer.csv', "wb")
writer = csv.writer(ofile, delimiter = ',')
writer.writerow(custID + choiceA + choiceB + choiceC)

Unfortunately all that does is add each row together, then create a CSV of each row summed together as one row. My true end goal would be to find the most occurring values in columns A-C and combine each customer into the same row, using the most occurring values. I'm awful at explaining. I'd want something that takes data.csv and makes this:

Customer_ID,ABC
1003,132
1004,110
2
  • "the most occurring values"? What do you want to happen if there are two ID/ABC pairs with the same number of occurrences? (E.g. 1003, 132 and 1003, 142, say.) Commented Mar 2, 2014 at 1:02
  • I don't particularly care which is chosen for now but I'd like to know for future reference how to manipulate which is chosen based on other calculations. Perhaps if the sale was during the first half of the year, it chooses the lower value, but if it's the second half of the year it chooses the higher. I'm still learning Python as I said so I greatly appreciate your help Commented Mar 2, 2014 at 1:09

1 Answer 1

3

You can sum the columns you're interested in, if their type is string:

In [11]: df = pd.read_csv('data.csv', index_col='Customer_ID')

In [12]: df
Out[12]:
                Date  Time OtherColumns  A  B  C  Cost
Customer_ID
1003         January  2:00        Stuff  1  5  2   519
1003         January  2:00        Stuff  1  3  2   530
1003         January  2:00        Stuff  1  3  2   530
1004             Feb  2:00        Stuff  1  1  0   699

In [13]: res = df[list('ABC')].astype(str).sum(1)  # cols = list('ABC')

In [14]: res
Out[14]:
Customer_ID
1003           152
1003           132
1003           132
1004           110
dtype: float64

To get the csv, you can first use to_frame to add the desired column name:

In [15]: res.to_frame(name='ABC')  # ''.join(cols)
Out[15]:
             ABC
Customer_ID
1003         152
1003         132
1003         132
1004         110

In [16]: res.to_frame(name='ABC').to_csv('new.csv')
Sign up to request clarification or add additional context in comments.

3 Comments

Amazing that it can be done in only 3 lines, I was clearly over-thinking it, thank you so much. Do you know of a way to code: if customer_ID = customer_ID on the last line then find the mode of all the A's, all the B's, and all the C's and return on row for each customer?
Not sure what asking, one part perhaps is res.groupby(level=0).last(). Best to ask new question explicitly in the scope of DataFrames rather than csvs! :)
I will look into it more, thank you! I posted a new question about it, but while I wait for a response I will look into res.groupby

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.