Python set from file memory optimization

Question

I am very new to python. In a python script I need to check if input string is present in the set 'titles'; which I load from newline separated strings in files 'titles'. It consumes huge memory. I chose to store in set because there is if inputstring in titles: later on.

Line #    Mem usage    Increment   Line Contents
================================================
     1    6.160 MiB    0.000 MiB   @profile
     2                             def loadtitles():
     3  515.387 MiB  509.227 MiB     titles = open('titles').read().split()
     4  602.555 MiB   87.168 MiB     titles = set(titles)

Q1. Is there any other object type more memory efficient to store this large data?

One solution I can come up with is if I load file as string, it consumes exactly the same memory as filesize; which is 100% optimal consumption of memory.

Line #    Mem usage    Increment   Line Contents
================================================
     1    6.160 MiB    0.000 MiB   @profile
     2                             def loadtitles():
     3  217.363 MiB  211.203 MiB     titles = open('titles').read()

then I can do if inputstring+'\n' in titles:

Q2. Is there a faster alternative to this?

Maciej Gol · Accepted Answer · 2014-02-23 12:28:26Z

1

You can either:

use a key/value store if you lookup lots of keys.
iterate over the file line by line and check for keys' existence if there are only a few keys to lookup.

answered Feb 23, 2014 at 12:28

Maciej Gol

15.9k4 gold badges35 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nexu Over a year ago

Using dict or shelve, it saves about 128 MiB memory. Any improvements you can suggest for that?

Maciej Gol Over a year ago

@Nexu, what memory threshold would satisfy you? You can't endlessly keep lowering memory footprint as you'll never reach satisfactory results.

Nexu Over a year ago

dict will do for me (@ 500M memory). I was simply asking if that code could still be made more efficient somehow with any alternative dictionary-like objects.

Maciej Gol Over a year ago

@Nexu, you could presort the input file and create table with offsets for each starting char. You could use lru cache for dict object. You could use memcached/redis/sqlite3/postgres or any other db engine

falsetru · Accepted Answer · 2014-02-23 12:28:41Z

1

Iterating file (processing line by line) instead of reading full contents of file will reduce memory consumption. (combining with generator expression):

def loadtitles():
    with open('titles') as f:
        titles = {word for line in f for word in line.split()}

answered Feb 23, 2014 at 12:28

falsetru

371k69 gold badges769 silver badges659 bronze badges

8 Comments

falsetru Over a year ago

@Nexu, titles = set(titles) is not necessary. title is already set object. ({ ... for .. in ... } is set comprehension.)

Nexu Over a year ago

Oh yes. Screenshot2. Saves only 1 MiB though. (python 2.7.3)

Nexu Over a year ago

Somehow, the one you suggested was way too slow. I think I'll stick with string in string comparison for now.

falsetru Over a year ago

@Nexu, Does if inputstring in titles multiple times or just once?

Nexu Over a year ago

Twice. With slightly different strings checked. It executes on user input. How will that affect it?

|

Collectives™ on Stack Overflow

Python set from file memory optimization

2 Answers 2

4 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related