2

I have the for loop code:

people = queue.Queue()
for person in set(list_):
    first_name,last_name = re.split(',| | ',person)
    people.put([first_name,last_name])

The list being iterated has 1,000,000+ items, it works, but takes a couple seconds to complete.

What changes can I make to help the processing speed?

Edit: I should add that this is Gevent's queue library

2
  • Can you post a sample line of person? Commented Dec 5, 2011 at 3:36
  • A little thing you can do is put the re outside the loop. E.g. splitter = re.compile(r',| | '), then use lastname,firstname = splitter.split(person) instead of re.split Commented Dec 5, 2011 at 4:07

4 Answers 4

1

The question is what is your queue being used for? If it isn't really necessary for threading purposes (or you can work around the threaded access) in this kind of situation, you want to switch to generators - you can think of them as the Python version of Unix shell pipes. So, your loop would look like:

def generate_people(list_):
    previous_row = None
    for person in sorted(list_):
        if person == previous_row:
            continue
        first_name,last_name = re.split(',| | ',person)
        yield [first_name,last_name]
        previous_row = person

and you would use this generator like this:

for first_name, last_name in generate_people():
    print first_name, last_name

This approach avoids what is probably your biggest performance hits - allocating memory to build a queue and a set with 1,000,000+ items on it. This approach works with one pair of strings at a time.

UPDATE

Based on more information about how threads play a roll in this, I'd use this solution instead.

people = queue.Queue()
previous_row = None
for person in sorted(list_):
    if person == previous_row:
        continue
    first_name,last_name = re.split(',| | ',person)
    people.put([first_name,last_name])
    previous_row = person

This replaces the set() operation with something that should be more efficient.

Sign up to request clarification or add additional context in comments.

10 Comments

I'm using the queue for threading since it's thread safe. I'm not sure your way of splitting will work since the regex code I have in place is used to split between multiple delimiters. I will give the approach as a generator and see if that helps. Thanks.
I didn't change anything about the split. Just reworked the function into a generator.
Oh sorry, I must have read another comment with the split being changed.. weird, my apologies.
Is a thread pulling from this queue as you are adding to it? If so, then the set operation may be the real performance hit here.
Nope, I add everything to the queue once then run through it.
|
1
with people.mutex:
    people.queue.extend(list(re.split(',| | ',person)) for person in set(list_))
    people.not_empty.notify_all()

Note that this completely ignores the queue capacity, but avoids lots of excessive locking.

1 Comment

0

I think you can use multi-threading reading data,and the queue concurrent queue.

Comments

0

I would try replacing regex with something a bit less intense:

first_name, last_name = person.split(', ')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.