I'm looking for some help understanding the performance characteristics of large lists, dicts or arrays in Python. I have about 1M key value pairs that I need to store temporarily (this will grow to maybe 10M over the next year). They keys are database IDs ranging from 0 to about 1.1M (with some gaps) and the values are floats.
I'm calculating pagerank, so my process is to initialize each ID with a value of 1, then look it up in memory and update it about ten times before saving it back to the database.
I'm theorizing that lists or arrays will be fastest if I use the database ID as the index of the array/list. This will create a gappy data structure, but I don't understand how fast look ups or updates will be. I also don't yet understand if there's a big gain to get from using
arraysinstead of lists.Using a dict for this is very natural, with key-value pairs, but I get the impression building the dict the first time would be very slow and memory intensive as it grows to accommodate all the entries.
I also read that SQLite might be a good solution for this using the
:memory:flag, but I haven't dug into that too much yet.
Anyway, just looking for some guidance here. Any thoughts would be much appreciated as I'm digging in.
numpyif you want to manipulate an array of million of entries. At the very least it will use about 2-3 times less memory, and if you are lucky you may vectorize operations giving huge speed gains. However you should be able to handle 10 million key/value pairs in adictwith a normal computer. Creating a bigdictsure takes time, but it's a linear-time operation so it wont take a huge amount of time. On my machine it takes 1.88 seconds to build a 10-million entrydict. and it uses about 700MB of RAM.numpyon a element-by-element way then you may see a slow down of the application, becausenumpyhas to perform the conversions between native data-types to python objects every time, so when you doarray[index]a new object is created etc. This is true also for the arrays in thearraymodule, so you should be very careful and do some real profiling on your application before deciding.