0

I apologize for the confusing title, but this is kind of a confusing question.

I have a CSV file with multiple columns, like in this example:

header_a | header_b | header_c | header_d
  abc         1         data1      data2
  abc         1         data3      data4
  abc         2         data5      data6
  abc         2         data7      data8
  abc         3         data9      data10

I need a script that would be able to transform this data to the following format:

header_a | header_b | header_c | header_d
  abc         1         data1      data2    data3      data4      
  abc         2         data5      data6    data7      data8      
  abc         3         data9      data10

I do not care about the header as much since there could me multiple entries. But in short, whenever the values in header_b match, I need all the values after it in the row to be appended to the first instance of it in the data frame.

I kind of have a skeleton of how i would approach the problem but I am stuck:

dd.sort_values('Purchase Order #', inplace=True)
values = dd['Purchase Order #'].unique().tolist()

for x in values:
    header_flag = False
    for row in dd['Purchase Order #']:
        if x == row:
            if header_flag == False:
                #This is the first purchase order, copy entire line
                print(row.tolist())
                #set the flag to True
                header_flag = True
            else:
                #We already have the first header, only copy next 5
                print('Else Block')
        else:
            #Do nothing
            print('False')

The first 2 lines sort it by the value that needs to match and pulls a list of unique ones in the dataframe. Is pandas perhaps not suited for this?

2 Answers 2

1

I haven't worked with Pandas but I'm able to achieve this without it. Assuming the headers and the first column 'abc' are static. I'll leave out the headers for simplicity and since you only care about combining the data.

My approach is to make header_b's value as key and the rest are a list of values.

>>> header_b = {}
>>> with open ('testfiles/test.csv') as csvfile:
...     next (csvfile)  # Skip headers
...     reader = csv.reader (csvfile)
...     for row in reader:
...         header_b.setdefault (row[1], [])  #  If header_b key is not in dictionary, add it
...         data = [row [0], row [2], row [3]]  # Create a list of data points
...         if row [0] in header_b [row [1]]:
...             data = [row [2], row [3]]  # If header_a is already in the list, skip
...         header_b [row [1]].extend (data)  # Or header_b [row [1]] += data
... 
>>> for key, values in header_b.items ():
...     string = ' '.join (values [1:])
...     print (values [0], key, string)
...

abc 2 data5 data6 data7 data8
abc 1 data1 data2 data3 data4
abc 3 data9 data10

Output is not ordered since dictionary aren't ordered. You can use OrderedDict if you want it to sort by keys.

>>> sorted_keys = OrderedDict (sorted (header_b.items ()))
>>> for key, values in sorted_keys.items ():
...     string = ' '.join (values [1:])
...     print (values [0], key, string)
... 

abc 1 data1 data2 data3 data4
abc 2 data5 data6 data7 data8
abc 3 data9 data10
Sign up to request clarification or add additional context in comments.

2 Comments

Hey, thanks for the reply! Unfortunately they first column (or any other columns) are not static only header_b could have duplicate values
I've updated the code. By static, I meant header_a will always have the value 'abc'. If it's not the case, the code above will not work because it doesn't know what to do if row 2 become def, 1, data3, data4.
0

Groupby should get you where you need to be. If data types are strings, you can one-line this as:

grp_sum = df.groupby('header_b').sum()

This won't add new columns of course, but if you have standard string patterns, you can split the columns. In your example,

def splitter(x):
    return (x[:5], x[5:])

split_cols = [x for x in zip(*grp_sum['header_c'].apply(splitter))]

1 Comment

Hey there, thanks for the reply! Unfortunately the data in questions is numbers so Pandas is adding them all up!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.