Storing data structure (set) as a file but not loading it into memory

Question

I have a 2gb text file for filtering out values used by a python program i access infrequently. I do this by loading every line into a set and doing fast contains checks. This was a good idea at first when the file was only a few megabytes in size, but after a year the file has grown, and the initial loading time has become unmanageable even though i have basically unlimited RAM.

Before i replace my existing code with a file-based binary search, i wanted to ask if there's any way to use set functionality directly as a file on disk? I know there are tools to store data structures and load them into memory, but the loading part is the problem here.

What you are asking for is a database, sqlite3 would work, and it comes with Python out of the box so no external dependencies. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Nov 9, 2019 at 2:48
@juanpa.arrivillaga this is probably what i'll go with to offload the binary search to the db, not sure why i didn't think of it given it was so obvious having used sqlite quite a lot. — j58765436
– j58765436, Commented Nov 9, 2019 at 3:13
Yeah, don't reinvent the wheel. Just make sure your table is indexed on your search column. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Nov 9, 2019 at 3:15

oppressionslayer · Accepted Answer · 2019-11-09 01:55:39Z

Your best bet is to store it to a database. MongoDB handles sets well. Then you can just query the database like you do a set.

you have to install

sudo apt install mongodb-server-core
pip3 install pymongo
And create a /data/db directory on your drive with the right permissions then run
mongod &
before this code will work:


from pymongo import MongoClient
client = MongoClient()

client = MongoClient('localhost', 27017)

#client = MongoClient('mongodb://localhost:27017')


db = client.pymongo_test

posts = db.posts
post_data = {
    'title': 'Python and MongoDB',
    'content': 'PyMongo is fun, you guys',
    'author': 'Bill'
}
result = posts.insert_one(post_data)
print('One post: {0}'.format(result.inserted_id))

bills_post = posts.find_one({'author': 'Bill'})
print(bills_post)

One post: 5dc61c0cc2b75ebc458da31f
{'_id': ObjectId('5dc61bf76071bde943ca262b'), 'title': 'Python and MongoDB', 'content': 'PyMongo is fun, you guys', 'author': 'Bill'}
``

Gouri · Accepted Answer · 2019-11-09 02:41:49Z

How about this as an interim approach until you find a DB or indexed file solution.

Divide the keyword file into multiple files(37), based on the character the keyword starts with. e.g.
- keys_startwith_a.txt /contains all values starting with 'A' or 'a' ...
- keys_startwith_z.txt /contains all values starting with 'Z' or 'z'
- keys_startwith_0.txt /contains all values starting with '0'
  
  ...
- keys_startwith_9.txt /contains all values starting with '9'
- keys_startwith_others.txt /contains all values starting with any other char
change the read mechanism for each file as stream. e.g.

io.read(file, buffer=1)

Now if you want to compare values, you just have to check the char the key starts with and compare against the values in corresponding file.

With this

Your file loading is faster(each file footprint should be significantly reduced) and on demand(stream) i.e. only when required
comparison is faster since check is against less no of values.
since you have unlimited memory, even if all the files are loaded during comparison, it . won't create issues.
Finally, you can perhaps use threads for even faster loading of files and can pass on the comparison tasks to multiple threads(if required)

Collectives™ on Stack Overflow

Storing data structure (set) as a file but not loading it into memory

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related