Python Datastructure Suggestion

Question

i am building a small search engine to search a collection of pdfs. From each pdf i extract a set of tokens and store it in database. I do not want to store duplicate tokens in database, instead i want to store count of each token in the database. Does python has any special datastructure that do not store duplicates but stores the counts of each token?

Björn Pollex · Accepted Answer · 2011-05-17 09:50:39Z

5

Python >=2.7 has the Counter.

edited May 17, 2011 at 9:50

answered May 17, 2011 at 9:42

Björn Pollex

77.1k30 gold badges206 silver badges290 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MartinStettner · Accepted Answer · 2011-05-17 09:43:25Z

3

I'd suggest to use a simple dictionary to store the count like

storage = {} # initialize
# ...
if !storage.has_key(token):
  storage[token] = 1
else:
  storage[token] += 1

EDIT

That said, if you're using Python 3 I'd follow Space_C0wb0y's suggestion to use the Counter class ...

answered May 17, 2011 at 9:43

MartinStettner

29.3k15 gold badges84 silver badges106 bronze badges

4 Comments

nikhil Over a year ago

if not storage.hash_key(token)

S.Lott Over a year ago

I'd use a collections.defaultdict and eliminate the if statement entirely.

Björn Pollex Over a year ago

@nikhil: Why did you accept this solution? It is quite inefficient. I think the only reason to do it this way is if you have a really old Python version.

Björn Pollex Over a year ago

You should test for a key using if token not in storage.

Martin Thurau · Accepted Answer · 2011-05-17 09:45:42Z

3

The collections package has defaultdict which can be used as a key-value storage with a counter:

>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
...     d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]

Just so notice: This is not a databse, it's a pure in memory storage. You would have to save this data somehow!

answered May 17, 2011 at 9:45

Martin Thurau

7,6847 gold badges46 silver badges80 bronze badges

Comments

Jakob Bowyer · Accepted Answer · 2011-05-17 09:57:02Z

0

You could always implement an object for every file, giving it a number of methods, like open and display and etc etc. You could then define __hash__ and __eq__ for the object, this would allow you to store items in a set, causing the duplicates to just update a single instance inside the set.

This is just another way of doing something by no means is it the best method.

answered May 17, 2011 at 9:57

Jakob Bowyer

34.8k8 gold badges80 silver badges92 bronze badges

Collectives™ on Stack Overflow

Python Datastructure Suggestion

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related