how to make a efficient filter in python

Question

I have a problem with two very large files(each more then 1.000.000 entries) in python: I need to generate a filter and I dont know why, I have two files like this:

1,2,3
2,4,5
3,3,4

and the second

1,"fege"
2,"greger"
4,"feffg"

the first item of each file row is always the ID. Now I want to filter the Lists, that the first list only contains items which ID's are in the second file. For this example the result should be:

1,2,3
2,4,5

how to make this in a very fast way? the core problem is, that each list is very very long. I used s.th. like this:

[row for row in myRows if row[0] == item[0]]

but this take a long time to run throw. (more than 30 days)

Well, if it's possible to order before - how are they generated? If they're in some kind of DB, can't it do a join or similar operation before exporting? — Jon Clements
– Jon Clements, Commented May 20, 2013 at 14:02
@reptilicus even just the keys of the second file will do by the looks of it — Jon Clements
– Jon Clements, Commented May 20, 2013 at 14:03

Fred Foo · Accepted Answer · 2013-05-20 14:33:03Z

7

[row for row in myRows if row[0] == item[0]]

is doing a linear scan for each item. If you use a set instead, you can bring this down to an expected constant time operation. First, read in the second file to get a set of valid ids:

with open("secondfile") as f:
    # note: only storing the ids, not the whole line
    valid_ids = set(ln.split(',', 1)[0] for ln in f)

Then you can filter the lines of the first file using the set valid_ids as

with open("firstfile") as f:
    matched_rows = [ln for ln in f if ln.split(',')[0] in valid_ids]

edited May 20, 2013 at 14:33

answered May 20, 2013 at 14:13

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

amit · Accepted Answer · 2013-05-20 14:18:36Z

1

I assume you are only interested in the first field. If so, you could try something like:

def _id(s):
  return s[:s.index(',')]

ids = {}
for line in open('first-file'):
 ids[_id(line)] = line
for line in open('second-file'):
 k = _id(line)
 if k in ids:
  print ids[k]

answered May 20, 2013 at 14:18

amit

10.9k12 gold badges64 silver badges60 bronze badges

Collectives™ on Stack Overflow

how to make a efficient filter in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related