Conditional filtering of the csv file with duplicates in python

Question

I am new to python, and need some advice on the following. I have a file with several fields, example below

# with duplicates
name1 14019 3 0.5564 0.0929 0.6494
name1 14022 0 0.5557 0.0990 0.6547
name1 14016 0 0.5511 0.0984 0.6495
name2 11 8 0.5119 0.0938 0.6057
name2 12 18 0.5331 0.0876 0.6206
name3 16 20 0.5172 0.0875 0.6047
name3 17 29 0.5441 0.0657 0.6098
# without duplicates
name1 14022 0 0.5557 0.0990 0.6547
name2 12 18 0.5331 0.0876 0.6206
name3 17 29 0.5441 0.0657 0.6098

First is the name, the other fields are numerals (from prediction). There are duplicates for prediction that have the same name but different predictions. My task is to remove duplicates based on comparison of the last field. The line with the MAXIMUM in the last column should be taken.

I am stack on the step of comparison of the last fields for the duplicate entries. Should I go with lambda or is direct filtering possible? Are the lists correct to use or it's possible to do that on the flow while reading row by row from file?

You help us greatly appreciated!

import csv

fi = open("filein.txt", "rb")
fo = open("fileout.txt", "wb")

reader = csv.reader(fi,delimiter=' ')
writer = csv.writer(fo,delimiter=' ')

names = set()
datum = []
datum2 = []

for row in reader:
  if row[0] not in names:
    names.add(row[0])
    row_new1 = [row[0],row[3],row[4],row[5]]
    datum.append(row_new)
    writer1.writerow(row_new1)
  else:
    row_new2 = [row[0],row[3],row[4],row[5]]
    datum2.append(row_new2)
    writer2.writerow(row_new2)

Harpal · Accepted Answer · 2013-06-14 09:42:06Z

1

The code below may be of some use, I did it using a dictionary:

import csv

fi = open("filein.txt", "rb")
reader = csv.reader(fi,delimiter=' ')

dict = {}
for row in reader:
    if row[0] in dict:
        if float(dict[row[0]][-1]) < float(row[-1]):
            dict[row[0]] = row[1:]
    else:
        dict[row[0]] = row[1:]
print dict

This outputs:

{'name2': ['12', '18', '0.5331', '0.0876', '0.6206'], 'name3': ['17', '29', '0.5441', '0.0657', '0.6098'], 'name1': ['14022', '0', '0.5557', '0.0990', '0.6547']}

answered Jun 14, 2013 at 9:42

Harpal

12.6k19 gold badges65 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

John Amraph Over a year ago

thank you! worked like a charm! I tried the dict but I messed up how to compare the dict values.

Harpal Over a year ago

No problem, what did you try when you compared the values?

John Amraph Over a year ago

I made dicts for the first occurrence, and dict for duplicates. And then tried to compare them to remove unwanted values. The problem for was the looping over dicts. Update: sorry, not the dicts - THE lists.

lucasg · Accepted Answer · 2013-06-14 09:58:50Z

0

itertools is your friend :

import csv
import itertools
import operator

fi = open("filein.txt", "rb")
fo = open("fileout.txt", "wb")

reader = csv.reader(fi,delimiter=' ',)
writer = csv.writer(fo,delimiter=' ')


# unpack datas in generator
duplicated_datas = ( tuple(row)  for row in reader )


# groupby name
groups = itertools.groupby(duplicated_datas,key=operator.itemgetter(0))


for k,v in groups:

    # sort by 5-th value
    val = [values for values in v]
    val.sort( key= lambda x: float(x[5]), reverse=True )

    #output
    writer.writerow( ",".join( [ i for i in val[0] ] ) )

answered Jun 14, 2013 at 9:58

lucasg

11k4 gold badges38 silver badges58 bronze badges

Comments

richie · Accepted Answer · 2013-06-14 10:15:34Z

I hope I have understood your question well. Pandas is a very effective library which you could also use for simple tasks such as these.

import pandas as pd
data = pd.read_csv('dataset.csv') # filein.txt in your case
length = len(data['names'].unique())
res = pd.DataFrame(columns=('names', 'field1', 'field2','field3','field4','field5'))
for i in range(0,length):
    name_filter = data[data['names'] == data['names'].unique()[i]] #filters the entire dataset based on the unique items in the 'names' field
    field5_max_filter = name_filter[name_filter['field5'] == name_filter['field5'].max() ] # filters the name based on the max value from 'field5'
    res = res.append(field5_max_filter, ignore_index=True) # appends output to new dataframe
    i=i+1
    res.to_csv('newdata.csv') # writes output to csv

Collectives™ on Stack Overflow

Conditional filtering of the csv file with duplicates in python

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related