1

I am new to python, and need some advice on the following. I have a file with several fields, example below

# with duplicates
name1 14019 3 0.5564 0.0929 0.6494
name1 14022 0 0.5557 0.0990 0.6547
name1 14016 0 0.5511 0.0984 0.6495
name2 11 8 0.5119 0.0938 0.6057
name2 12 18 0.5331 0.0876 0.6206
name3 16 20 0.5172 0.0875 0.6047
name3 17 29 0.5441 0.0657 0.6098
# without duplicates
name1 14022 0 0.5557 0.0990 0.6547
name2 12 18 0.5331 0.0876 0.6206
name3 17 29 0.5441 0.0657 0.6098

First is the name, the other fields are numerals (from prediction). There are duplicates for prediction that have the same name but different predictions. My task is to remove duplicates based on comparison of the last field. The line with the MAXIMUM in the last column should be taken.

I am stack on the step of comparison of the last fields for the duplicate entries. Should I go with lambda or is direct filtering possible? Are the lists correct to use or it's possible to do that on the flow while reading row by row from file?

You help us greatly appreciated!

import csv

fi = open("filein.txt", "rb")
fo = open("fileout.txt", "wb")

reader = csv.reader(fi,delimiter=' ')
writer = csv.writer(fo,delimiter=' ')

names = set()
datum = []
datum2 = []

for row in reader:
  if row[0] not in names:
    names.add(row[0])
    row_new1 = [row[0],row[3],row[4],row[5]]
    datum.append(row_new)
    writer1.writerow(row_new1)
  else:
    row_new2 = [row[0],row[3],row[4],row[5]]
    datum2.append(row_new2)
    writer2.writerow(row_new2)

3 Answers 3

1

The code below may be of some use, I did it using a dictionary:

import csv

fi = open("filein.txt", "rb")
reader = csv.reader(fi,delimiter=' ')

dict = {}
for row in reader:
    if row[0] in dict:
        if float(dict[row[0]][-1]) < float(row[-1]):
            dict[row[0]] = row[1:]
    else:
        dict[row[0]] = row[1:]
print dict

This outputs:

{'name2': ['12', '18', '0.5331', '0.0876', '0.6206'], 'name3': ['17', '29', '0.5441', '0.0657', '0.6098'], 'name1': ['14022', '0', '0.5557', '0.0990', '0.6547']}
Sign up to request clarification or add additional context in comments.

3 Comments

thank you! worked like a charm! I tried the dict but I messed up how to compare the dict values.
No problem, what did you try when you compared the values?
I made dicts for the first occurrence, and dict for duplicates. And then tried to compare them to remove unwanted values. The problem for was the looping over dicts. Update: sorry, not the dicts - THE lists.
0

itertools is your friend :

import csv
import itertools
import operator

fi = open("filein.txt", "rb")
fo = open("fileout.txt", "wb")

reader = csv.reader(fi,delimiter=' ',)
writer = csv.writer(fo,delimiter=' ')


# unpack datas in generator
duplicated_datas = ( tuple(row)  for row in reader )


# groupby name
groups = itertools.groupby(duplicated_datas,key=operator.itemgetter(0))


for k,v in groups:

    # sort by 5-th value
    val = [values for values in v]
    val.sort( key= lambda x: float(x[5]), reverse=True )

    #output
    writer.writerow( ",".join( [ i for i in val[0] ] ) )

Comments

0

I hope I have understood your question well. Pandas is a very effective library which you could also use for simple tasks such as these.

import pandas as pd
data = pd.read_csv('dataset.csv') # filein.txt in your case
length = len(data['names'].unique())
res = pd.DataFrame(columns=('names', 'field1', 'field2','field3','field4','field5'))
for i in range(0,length):
    name_filter = data[data['names'] == data['names'].unique()[i]] #filters the entire dataset based on the unique items in the 'names' field
    field5_max_filter = name_filter[name_filter['field5'] == name_filter['field5'].max() ] # filters the name based on the max value from 'field5'
    res = res.append(field5_max_filter, ignore_index=True) # appends output to new dataframe
    i=i+1
    res.to_csv('newdata.csv') # writes output to csv

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.