3

I have a 2gb text file for filtering out values used by a python program i access infrequently. I do this by loading every line into a set and doing fast contains checks. This was a good idea at first when the file was only a few megabytes in size, but after a year the file has grown, and the initial loading time has become unmanageable even though i have basically unlimited RAM.

Before i replace my existing code with a file-based binary search, i wanted to ask if there's any way to use set functionality directly as a file on disk? I know there are tools to store data structures and load them into memory, but the loading part is the problem here.

3
  • 1
    What you are asking for is a database, sqlite3 would work, and it comes with Python out of the box so no external dependencies. Commented Nov 9, 2019 at 2:48
  • 1
    @juanpa.arrivillaga this is probably what i'll go with to offload the binary search to the db, not sure why i didn't think of it given it was so obvious having used sqlite quite a lot. Commented Nov 9, 2019 at 3:13
  • Yeah, don't reinvent the wheel. Just make sure your table is indexed on your search column. Commented Nov 9, 2019 at 3:15

2 Answers 2

1

Your best bet is to store it to a database. MongoDB handles sets well. Then you can just query the database like you do a set.

you have to install

sudo apt install mongodb-server-core
pip3 install pymongo
And create a /data/db directory on your drive with the right permissions then run
mongod &
before this code will work:

from pymongo import MongoClient
client = MongoClient()

client = MongoClient('localhost', 27017)

#client = MongoClient('mongodb://localhost:27017')


db = client.pymongo_test

posts = db.posts
post_data = {
    'title': 'Python and MongoDB',
    'content': 'PyMongo is fun, you guys',
    'author': 'Bill'
}
result = posts.insert_one(post_data)
print('One post: {0}'.format(result.inserted_id))

bills_post = posts.find_one({'author': 'Bill'})
print(bills_post)

One post: 5dc61c0cc2b75ebc458da31f
{'_id': ObjectId('5dc61bf76071bde943ca262b'), 'title': 'Python and MongoDB', 'content': 'PyMongo is fun, you guys', 'author': 'Bill'}
``
Sign up to request clarification or add additional context in comments.

Comments

1

How about this as an interim approach until you find a DB or indexed file solution.

  1. Divide the keyword file into multiple files(37), based on the character the keyword starts with. e.g.

    • keys_startwith_a.txt /contains all values starting with 'A' or 'a' ...
    • keys_startwith_z.txt /contains all values starting with 'Z' or 'z'
    • keys_startwith_0.txt /contains all values starting with '0'

      ...

    • keys_startwith_9.txt /contains all values starting with '9'
    • keys_startwith_others.txt /contains all values starting with any other char
  2. change the read mechanism for each file as stream. e.g.

io.read(file, buffer=1)

Now if you want to compare values, you just have to check the char the key starts with and compare against the values in corresponding file.

With this

  • Your file loading is faster(each file footprint should be significantly reduced) and on demand(stream) i.e. only when required
  • comparison is faster since check is against less no of values.
  • since you have unlimited memory, even if all the files are loaded during comparison, it . won't create issues.
  • Finally, you can perhaps use threads for even faster loading of files and can pass on the comparison tasks to multiple threads(if required)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.