0

I am very new to python. In a python script I need to check if input string is present in the set 'titles'; which I load from newline separated strings in files 'titles'. It consumes huge memory. I chose to store in set because there is if inputstring in titles: later on.

Line #    Mem usage    Increment   Line Contents
================================================
     1    6.160 MiB    0.000 MiB   @profile
     2                             def loadtitles():
     3  515.387 MiB  509.227 MiB     titles = open('titles').read().split()
     4  602.555 MiB   87.168 MiB     titles = set(titles)

Q1. Is there any other object type more memory efficient to store this large data?

One solution I can come up with is if I load file as string, it consumes exactly the same memory as filesize; which is 100% optimal consumption of memory.

Line #    Mem usage    Increment   Line Contents
================================================
     1    6.160 MiB    0.000 MiB   @profile
     2                             def loadtitles():
     3  217.363 MiB  211.203 MiB     titles = open('titles').read()

then I can do if inputstring+'\n' in titles:

Q2. Is there a faster alternative to this?

2 Answers 2

1

You can either:

  • use a key/value store if you lookup lots of keys.
  • iterate over the file line by line and check for keys' existence if there are only a few keys to lookup.
Sign up to request clarification or add additional context in comments.

4 Comments

Using dict or shelve, it saves about 128 MiB memory. Any improvements you can suggest for that?
@Nexu, what memory threshold would satisfy you? You can't endlessly keep lowering memory footprint as you'll never reach satisfactory results.
dict will do for me (@ 500M memory). I was simply asking if that code could still be made more efficient somehow with any alternative dictionary-like objects.
@Nexu, you could presort the input file and create table with offsets for each starting char. You could use lru cache for dict object. You could use memcached/redis/sqlite3/postgres or any other db engine
1

Iterating file (processing line by line) instead of reading full contents of file will reduce memory consumption. (combining with generator expression):

def loadtitles():
    with open('titles') as f:
        titles = {word for line in f for word in line.split()}

8 Comments

@Nexu, titles = set(titles) is not necessary. title is already set object. ({ ... for .. in ... } is set comprehension.)
Oh yes. Screenshot2. Saves only 1 MiB though. (python 2.7.3)
Somehow, the one you suggested was way too slow. I think I'll stick with string in string comparison for now.
@Nexu, Does if inputstring in titles multiple times or just once?
Twice. With slightly different strings checked. It executes on user input. How will that affect it?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.