1

I have seen lots of question/answers on this but none that I have looked at have solved my problem, so any help would be appreciated.

I have a very large CSV file that has some duplicated column entries but I would like a script to match and merge the rows based on the 1st column. (I do not want to use pandas. I am using Python 2.7. There is no CSV headers in the file)

This is the input:

2144, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW 
8432, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW
0055, 0.00, 0.00, 2014, 2017
2144, 0.00, 0.00, 2016, 959
8432, 22.9, 0.00, 2015, 2018 
0055, 2014, 505, 20004, 2037, LL, GLO, X2, QAL

Wanted output:

2144, 0.00, 0.00, 2016, 959, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW  
0055, 0.00, 0.00, 2014, 2017, 2014, 505, 20004, 2037, LL, GLO, X2, QAL   
8432, 22.9, 0.00, 2015, 2018, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW

I have tried :

reader = csv.reader(open('input.csv))
result = {}

for row in reader:
    idx = row[0]
    values = row[1:]
    if idx in result:
        result[idx] = [result[idx][i] or v for i, v in enumerate(values)]
    else:
        result[idx] = values

and this to search duplicates:

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue

But these haven't helped me- I'm lost

Any help would be great.

Thanks

1
  • Thanks Piinthesky. I have edited above. I am lost and not sure where to start Commented Feb 13, 2018 at 23:26

1 Answer 1

1

Try using a dictionary, with the value of the 1st column as your key. Here's how I would do it :

with open('myfile.csv') as csvfile:
    reader = list(csv.reader(csvfile, skipinitialspace=True))  # remove the spaces after the commas
    result = {}  # or collections.OrderedDict() if the output order is important
    for row in reader:
        if row[0] in result:
            result[row[0]].extend(row[1:])  # do not include the key again
        else:
            result[row[0]] = row

    # result.values() returns your wanted output, for example :
    for row in result.values():
        print(', '.join(row))
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you. I am hoping this will work. I got the following error. " if row[0] in result: IndexError: list index out of range" Not sure why? any ideas? Thanks again
I think making it reader = list(csv.reader(csvfile, skipinitialspace=True)) should work.
Thankyou- unfortunately now it takes some time then returns with a memory error.
What's the error and how large is the file? I'm guessing the file is too large to fit in memory. If that's the case, you'll need to follow similar steps but in chunks, writing the new output to a file.
The error is " reader = list(csv.reader(csvfile, skipinitialspace=True)) MemoryError". Yes the file is 1,411,035 KB. Something like this...chunk, chunksize = [], 100 def process_chunk(chuck): print len(chuck)?? thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.