Memory efficient groupby in Python

Question

I have a very large file sorted on a field. I'd like to read this data and group lines together than contain the same value in the field. For example:

I have a file with two fields:

12    fish
50    fish
1     turtle
11    dog
34    dog
12    dog

I'm looking for a solution that uses an iterator or a generator. It's not possible for me to read all the data into memory, only one group (inner list) as a time. I was trying to use groupby, but couldn't figure out how to group based on the same value in a field.

How can I product lists like this:

[[12, fish], [50, fish]]
[[1, turtle]]
[[11, dog], [34, dog] [12, dog]]

Jon Clements · Accepted Answer · 2013-02-06 16:28:07Z

6

from itertools import groupby
from operator import itemgetter

with open('somefile') as fin:
    lines = (line.split() for line in fin)
    for key, items in groupby(lines, itemgetter(1)):
        print list(items)

[['12', 'fish'], ['50', 'fish']]
[['1', 'turtle']]
[['11', 'dog'], ['34', 'dog'], ['12', 'dog']]

answered Feb 6, 2013 at 16:28

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Memory efficient groupby in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related