1

I have a big file with entries as opened in python as:

 fh_in=open('/xzy/abc', 'r') 
 parsed_in=csv.reader(fh_in, delimiter=',')
 for element in parsed_in:
  print(element)

RESULT:

['ABC', 'chr9', '3468582', 'NAME1', 'UGA', 'GGU']

['DEF', 'chr9', '14855289', NAME19', 'UCG', 'GUC']

['TTC', 'chr9', '793946', 'NAME178', 'CAG', 'GUC']

['ABC', 'chr9', '3468582', 'NAME272', 'UGT', 'GCU']

I have to extract only the unique entries and to remove entries with same values in col1, col2 and col3. Like in this case last line is same as line 1 on the basis of col1, col2 and col3.

I have tried two methods but failed:

Method 1:

outlist=[]

for element in parsed_in:     
  if element[0:3] not in outlist[0:3]:
    outlist.append(element)

Method 2:

outlist=[]
parsed_list=list(parsed_in)
for element in range(0,len(parsed_list)):
  if parsed_list[element] not in parsed_list[element+1:]:
    outlist.append(parsed_list[element])

These both gives back all the entries and not unique entries on basis of first 3 columns.

Please suggest me a way to do so

AK

2
  • 3
    possible duplicate of How do you remove duplicates from a list in Python? Commented Mar 1, 2012 at 20:52
  • Not a duplicate as his list is unique based on only part of data and not the whole data set. Commented Mar 1, 2012 at 20:55

2 Answers 2

3

You probably want to use an O(1) lookup to save yourself a full scan of the elements while adding, and like Caol Acain said, sets is a good way to do it.

What you want to do is something like:

outlist=[]
added_keys = set()

for row in parsed_in:
    # We use tuples because they are hashable
    lookup = tuple(row[:3])    
    if lookup not in added_keys:
        outlist.append(row)
        added_keys.add(lookup)

You could alternately have used a dictionary mapping the key to the row, but this would have the caveat that you would not preserve the ordering of the input, so having the list and the key set allows you to keep the ordering as in-file.

Sign up to request clarification or add additional context in comments.

1 Comment

First good answer, much better than the one I was going to post. +1
0

Convert your lists to sets!

http://docs.python.org/tutorial/datastructures.html#sets

1 Comment

I thought this first as well but if you read the problem closer you will see that sets won't work. Each item in the list is unique only on the first three elements on the sub lists.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.