i am building a small search engine to search a collection of pdfs. From each pdf i extract a set of tokens and store it in database. I do not want to store duplicate tokens in database, instead i want to store count of each token in the database. Does python has any special datastructure that do not store duplicates but stores the counts of each token?
4 Answers
I'd suggest to use a simple dictionary to store the count like
storage = {} # initialize
# ...
if !storage.has_key(token):
storage[token] = 1
else:
storage[token] += 1
EDIT
That said, if you're using Python 3 I'd follow Space_C0wb0y's suggestion to use the Counter class ...
4 Comments
collections.defaultdict and eliminate the if statement entirely.if token not in storage.The collections package has defaultdict which can be used as a key-value storage with a counter:
>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
... d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]
Just so notice: This is not a databse, it's a pure in memory storage. You would have to save this data somehow!
Comments
You could always implement an object for every file, giving it a number of methods, like open and display and etc etc. You could then define __hash__ and __eq__ for the object, this would allow you to store items in a set, causing the duplicates to just update a single instance inside the set.
This is just another way of doing something by no means is it the best method.