3

I am working on a project where I'm reading as many as 250000 items or more in a list and converting each of it's entries as key to a hash table.

sample_key = open("sample_file.txt").readlines()
sample_counter = [0] * (len(sample_key))
sample_hash = {sample.replace('\n', ''):counter for sample, counter in zip(sample_key, sample_counter)}

This code works well when len(sample_key) is in the range 1000-2000. Beyound that it simply ignores processing any further data.

Any suggestions, how can I handle this large list data?

PS: Also, If there is an optimal way to perform this task(like reading directly as a hash key entry), then please suggest. I'm new to Python.

10
  • 1
    There's no code reason why that shouldn't work for longer lengths. Perhaps your program is running out of memory, if the items in question aren't small? Commented Feb 14, 2016 at 17:54
  • 1
    As far as I know python dictionarys work as hash tables Commented Feb 14, 2016 at 17:55
  • 4
    "Beyound that it simply ignores processing any further data." How? You computer saying "Nope, not gonna do it"? Do you get an exception, are no more values added to the dict, or can values put into the dict not be retrieved, or is it slower than expected? Commented Feb 14, 2016 at 17:55
  • @tobias_k I see that in the debugger! No exception, no warning or error. So I said simply ignores. :) Commented Feb 14, 2016 at 17:56
  • 2
    My guess is that some of the lines are the same, and your dict comprehension overwrites previously inserted keys. You are aware that keys in a dictionary are unique, right? Also, not quote sure what you are trying to achieve, but I think you might be interested in collections.Counter Commented Feb 14, 2016 at 17:58

1 Answer 1

6

Your text file can have duplicates which will overwrite existing keys in your dictionary (the python name for a hash table). You can create a unique set of your keys, and then use a dictionary comprehension to populate the dictionary.

sample_file.txt

a
b
c
c

Python code

with open("sample_file.txt") as f:
    keys = set(line.strip() for line in f.readlines())
my_dict = {key: 1 for key in keys if key}
>>> my_dict
{'a': 1, 'b': 1, 'c': 1}

Here is an implementation with 1 million random alpha characters of length 10. The timing is relatively trivial at under half a second.

import string
import numpy as np

letter_map = {n: letter for n, letter in enumerate(string.ascii_lowercase, 1)}
long_alpha_list = ["".join([letter_map[number] for number in row]) + "\n" 
                   for row in np.random.random_integers(1, 26, (1000000, 10))]
>>> long_alpha_list[:5]
['mfeeidurfc\n',
 'njbfzpunzi\n',
 'yrazcjnegf\n',
 'wpuxpaqhhs\n',
 'fpncybprrn\n']

>>> len(long_alpha_list)
1000000

# Write list to file.
with open('sample_file.txt', 'wb') as f:
    f.writelines(long_alpha_list)

# Read them back into a dictionary per the method above.
with open("sample_file.txt") as f:
    keys = set(line.strip() for line in f.readlines())

>>> %%timeit -n 10
>>> my_dict = {key: 1 for key in keys if key}

10 loops, best of 3: 379 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.