I am new to python, and need some advice on the following. I have a file with several fields, example below
# with duplicates
name1 14019 3 0.5564 0.0929 0.6494
name1 14022 0 0.5557 0.0990 0.6547
name1 14016 0 0.5511 0.0984 0.6495
name2 11 8 0.5119 0.0938 0.6057
name2 12 18 0.5331 0.0876 0.6206
name3 16 20 0.5172 0.0875 0.6047
name3 17 29 0.5441 0.0657 0.6098
# without duplicates
name1 14022 0 0.5557 0.0990 0.6547
name2 12 18 0.5331 0.0876 0.6206
name3 17 29 0.5441 0.0657 0.6098
First is the name, the other fields are numerals (from prediction). There are duplicates for prediction that have the same name but different predictions. My task is to remove duplicates based on comparison of the last field. The line with the MAXIMUM in the last column should be taken.
I am stack on the step of comparison of the last fields for the duplicate entries. Should I go with lambda or is direct filtering possible? Are the lists correct to use or it's possible to do that on the flow while reading row by row from file?
You help us greatly appreciated!
import csv
fi = open("filein.txt", "rb")
fo = open("fileout.txt", "wb")
reader = csv.reader(fi,delimiter=' ')
writer = csv.writer(fo,delimiter=' ')
names = set()
datum = []
datum2 = []
for row in reader:
if row[0] not in names:
names.add(row[0])
row_new1 = [row[0],row[3],row[4],row[5]]
datum.append(row_new)
writer1.writerow(row_new1)
else:
row_new2 = [row[0],row[3],row[4],row[5]]
datum2.append(row_new2)
writer2.writerow(row_new2)