How create a hash table for large data in python?

Question

I am working on a project where I'm reading as many as 250000 items or more in a list and converting each of it's entries as key to a hash table.

sample_key = open("sample_file.txt").readlines()
sample_counter = [0] * (len(sample_key))
sample_hash = {sample.replace('\n', ''):counter for sample, counter in zip(sample_key, sample_counter)}

This code works well when len(sample_key) is in the range 1000-2000. Beyound that it simply ignores processing any further data.

Any suggestions, how can I handle this large list data?

PS: Also, If there is an optimal way to perform this task(like reading directly as a hash key entry), then please suggest. I'm new to Python.

There's no code reason why that shouldn't work for longer lengths. Perhaps your program is running out of memory, if the items in question aren't small? — Amber
– Amber, Commented Feb 14, 2016 at 17:54
"Beyound that it simply ignores processing any further data." How? You computer saying "Nope, not gonna do it"? Do you get an exception, are no more values added to the dict, or can values put into the dict not be retrieved, or is it slower than expected? — tobias_k
– tobias_k, Commented Feb 14, 2016 at 17:55
@tobias_k I see that in the debugger! No exception, no warning or error. So I said simply ignores. :) — kishoredbn
– kishoredbn, Commented Feb 14, 2016 at 17:56
My guess is that some of the lines are the same, and your dict comprehension overwrites previously inserted keys. You are aware that keys in a dictionary are unique, right? Also, not quote sure what you are trying to achieve, but I think you might be interested in collections.Counter — tobias_k
– tobias_k, Commented Feb 14, 2016 at 17:58

Alexander · Accepted Answer · 2016-02-14 18:39:58Z

Your text file can have duplicates which will overwrite existing keys in your dictionary (the python name for a hash table). You can create a unique set of your keys, and then use a dictionary comprehension to populate the dictionary.

sample_file.txt

a
b
c
c

Python code

with open("sample_file.txt") as f:
    keys = set(line.strip() for line in f.readlines())
my_dict = {key: 1 for key in keys if key}
>>> my_dict
{'a': 1, 'b': 1, 'c': 1}

Here is an implementation with 1 million random alpha characters of length 10. The timing is relatively trivial at under half a second.

import string
import numpy as np

letter_map = {n: letter for n, letter in enumerate(string.ascii_lowercase, 1)}
long_alpha_list = ["".join([letter_map[number] for number in row]) + "\n" 
                   for row in np.random.random_integers(1, 26, (1000000, 10))]
>>> long_alpha_list[:5]
['mfeeidurfc\n',
 'njbfzpunzi\n',
 'yrazcjnegf\n',
 'wpuxpaqhhs\n',
 'fpncybprrn\n']

>>> len(long_alpha_list)
1000000

# Write list to file.
with open('sample_file.txt', 'wb') as f:
    f.writelines(long_alpha_list)

# Read them back into a dictionary per the method above.
with open("sample_file.txt") as f:
    keys = set(line.strip() for line in f.readlines())

>>> %%timeit -n 10
>>> my_dict = {key: 1 for key in keys if key}

10 loops, best of 3: 379 ms per loop

Collectives™ on Stack Overflow

How create a hash table for large data in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related