I have seen lots of question/answers on this but none that I have looked at have solved my problem, so any help would be appreciated.
I have a very large CSV file that has some duplicated column entries but I would like a script to match and merge the rows based on the 1st column. (I do not want to use pandas. I am using Python 2.7. There is no CSV headers in the file)
This is the input:
2144, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW
8432, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW
0055, 0.00, 0.00, 2014, 2017
2144, 0.00, 0.00, 2016, 959
8432, 22.9, 0.00, 2015, 2018
0055, 2014, 505, 20004, 2037, LL, GLO, X2, QAL
Wanted output:
2144, 0.00, 0.00, 2016, 959, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW
0055, 0.00, 0.00, 2014, 2017, 2014, 505, 20004, 2037, LL, GLO, X2, QAL
8432, 22.9, 0.00, 2015, 2018, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW
I have tried :
reader = csv.reader(open('input.csv))
result = {}
for row in reader:
idx = row[0]
values = row[1:]
if idx in result:
result[idx] = [result[idx][i] or v for i, v in enumerate(values)]
else:
result[idx] = values
and this to search duplicates:
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue
But these haven't helped me- I'm lost
Any help would be great.
Thanks