Python script to merge rows based on 1st column

Question

I have seen lots of question/answers on this but none that I have looked at have solved my problem, so any help would be appreciated.

I have a very large CSV file that has some duplicated column entries but I would like a script to match and merge the rows based on the 1st column. (I do not want to use pandas. I am using Python 2.7. There is no CSV headers in the file)

This is the input:

2144, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW 
8432, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW
0055, 0.00, 0.00, 2014, 2017
2144, 0.00, 0.00, 2016, 959
8432, 22.9, 0.00, 2015, 2018 
0055, 2014, 505, 20004, 2037, LL, GLO, X2, QAL

Wanted output:

2144, 0.00, 0.00, 2016, 959, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW  
0055, 0.00, 0.00, 2014, 2017, 2014, 505, 20004, 2037, LL, GLO, X2, QAL   
8432, 22.9, 0.00, 2015, 2018, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW

I have tried :

reader = csv.reader(open('input.csv))
result = {}

for row in reader:
    idx = row[0]
    values = row[1:]
    if idx in result:
        result[idx] = [result[idx][i] or v for i, v in enumerate(values)]
    else:
        result[idx] = values

and this to search duplicates:

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue

But these haven't helped me- I'm lost

Any help would be great.

Thanks

Thanks Piinthesky. I have edited above. I am lost and not sure where to start — Pbree
– Pbree, Commented Feb 13, 2018 at 23:26

ZaxR · Accepted Answer · 2018-02-14 04:23:20Z

1

Try using a dictionary, with the value of the 1st column as your key. Here's how I would do it :

with open('myfile.csv') as csvfile:
    reader = list(csv.reader(csvfile, skipinitialspace=True))  # remove the spaces after the commas
    result = {}  # or collections.OrderedDict() if the output order is important
    for row in reader:
        if row[0] in result:
            result[row[0]].extend(row[1:])  # do not include the key again
        else:
            result[row[0]] = row

    # result.values() returns your wanted output, for example :
    for row in result.values():
        print(', '.join(row))

edited Feb 14, 2018 at 4:23

ZaxR

5,1954 gold badges29 silver badges46 bronze badges

answered Feb 13, 2018 at 23:35

Manur

8,8433 gold badges32 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Pbree Over a year ago

Thank you. I am hoping this will work. I got the following error. " if row[0] in result: IndexError: list index out of range" Not sure why? any ideas? Thanks again

ZaxR Over a year ago

I think making it reader = list(csv.reader(csvfile, skipinitialspace=True)) should work.

Pbree Over a year ago

Thankyou- unfortunately now it takes some time then returns with a memory error.

ZaxR Over a year ago

What's the error and how large is the file? I'm guessing the file is too large to fit in memory. If that's the case, you'll need to follow similar steps but in chunks, writing the new output to a file.

Pbree Over a year ago

The error is " reader = list(csv.reader(csvfile, skipinitialspace=True)) MemoryError". Yes the file is 1,411,035 KB. Something like this...chunk, chunksize = [], 100 def process_chunk(chuck): print len(chuck)?? thanks

Collectives™ on Stack Overflow

Python script to merge rows based on 1st column

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related