2

i am building a small search engine to search a collection of pdfs. From each pdf i extract a set of tokens and store it in database. I do not want to store duplicate tokens in database, instead i want to store count of each token in the database. Does python has any special datastructure that do not store duplicates but stores the counts of each token?

4 Answers 4

5

Python >=2.7 has the Counter.

Sign up to request clarification or add additional context in comments.

Comments

3

I'd suggest to use a simple dictionary to store the count like

storage = {} # initialize
# ...
if !storage.has_key(token):
  storage[token] = 1
else:
  storage[token] += 1

EDIT

That said, if you're using Python 3 I'd follow Space_C0wb0y's suggestion to use the Counter class ...

4 Comments

if not storage.hash_key(token)
I'd use a collections.defaultdict and eliminate the if statement entirely.
@nikhil: Why did you accept this solution? It is quite inefficient. I think the only reason to do it this way is if you have a really old Python version.
You should test for a key using if token not in storage.
3

The collections package has defaultdict which can be used as a key-value storage with a counter:

>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
...     d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]

Just so notice: This is not a databse, it's a pure in memory storage. You would have to save this data somehow!

Comments

0

You could always implement an object for every file, giving it a number of methods, like open and display and etc etc. You could then define __hash__ and __eq__ for the object, this would allow you to store items in a set, causing the duplicates to just update a single instance inside the set.

This is just another way of doing something by no means is it the best method.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.