3

I have a problem with two very large files(each more then 1.000.000 entries) in python: I need to generate a filter and I dont know why, I have two files like this:

1,2,3
2,4,5
3,3,4

and the second

1,"fege"
2,"greger"
4,"feffg"

the first item of each file row is always the ID. Now I want to filter the Lists, that the first list only contains items which ID's are in the second file. For this example the result should be:

1,2,3
2,4,5

how to make this in a very fast way? the core problem is, that each list is very very long. I used s.th. like this:

[row for row in myRows if row[0] == item[0]]

but this take a long time to run throw. (more than 30 days)

9
  • No, they are not ordered Commented May 20, 2013 at 14:01
  • But it would be possible to order it before. Commented May 20, 2013 at 14:01
  • Well, if it's possible to order before - how are they generated? If they're in some kind of DB, can't it do a join or similar operation before exporting? Commented May 20, 2013 at 14:02
  • Can you read both files into memory? Commented May 20, 2013 at 14:03
  • @reptilicus even just the keys of the second file will do by the looks of it Commented May 20, 2013 at 14:03

2 Answers 2

7
[row for row in myRows if row[0] == item[0]]

is doing a linear scan for each item. If you use a set instead, you can bring this down to an expected constant time operation. First, read in the second file to get a set of valid ids:

with open("secondfile") as f:
    # note: only storing the ids, not the whole line
    valid_ids = set(ln.split(',', 1)[0] for ln in f)

Then you can filter the lines of the first file using the set valid_ids as

with open("firstfile") as f:
    matched_rows = [ln for ln in f if ln.split(',')[0] in valid_ids]
Sign up to request clarification or add additional context in comments.

Comments

1

I assume you are only interested in the first field. If so, you could try something like:

def _id(s):
  return s[:s.index(',')]

ids = {}
for line in open('first-file'):
 ids[_id(line)] = line
for line in open('second-file'):
 k = _id(line)
 if k in ids:
  print ids[k]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.