Very basic persistence or data storage for a Python script

Question

I have a script which processes a list of URLs. The script may be called at any time with a fresh list of URLs. I want to avoid processing an URL which has already been processed at any time in the past.

At this point, all I want to match are URLs, which are really long strings, against all previously processed URLs, to ensure uniqueness.

My question is, how does an SQL query matching a text URL against a MySQL database of only URLs (say 40000 long text URLs) compare, against my other idea of hashing the URLs and saving the hashes using, say, Python's shelve module?

shelf[hash(url)] = 1

Is shelve usable for a dictionary with 40000 string keys? What about with 40000 numerical keys with binary values? Any gotchas with choosing shelve over MySQL for this simple requirement?

Or, if I use a DB, is there a huge benefit to store URL hashes in my MySQL DB instead of string URLs?

i think your idea of saving the hash is very good which can boost the performance a lot. but try to choose a hash algorithm which has low collision. — James.Xu
– James.Xu, Commented Apr 3, 2011 at 8:40

Adam Matan · Accepted Answer · 2011-04-03 10:31:45Z

3

Store the URLs in a set, which assures O(1) for finding items, and shelve it. At this amount of URLs, storing and restoring will take very little time and memory:

import shelve

# Write URLS to shelve
urls= ['http://www.airmagnet.com/', 'http://www.alcatel-lucent.com/',
       'http://www.ami.com/', 'http://www.apcc.com/', 'http://www.stk.com/',
       'http://www.apani.com/', 'http://www.apple.com/',
       'http://www.arcoide.com/', 'http://www.areca.com.tw/',
       'http://www.argus-systems.com/', 'http://www.ariba.com/',
       'http://www.asus.com.tw/']

s=set(urls)                        # Store URLs as set - Search is O(1)
sh=shelve.open('/tmp/shelve.tmp')  # Dump set (as one unit) to shelve file
sh['urls']=s
sh.close()

sh=shelve.open('/tmp/shelve.tmp')  # Retrieve set from file
s=sh['urls']
print 'http://www.apple.com/' in s # True
print 'http://matan.name/'    in s # False

This approach is quite fast:

import random
import string
import shelve
import datetime


urls=[''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(50))
          for i in range(40000)]
s=set(urls)
start=datetime.datetime.now()
sh=shelve.open('/tmp/test.shelve')
sh['urls']=urls
end=datetime.datetime.now()
print end-start

edited Apr 3, 2011 at 10:31

answered Apr 3, 2011 at 8:49

Adam Matan

138k155 gold badges414 silver badges584 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2665694 Over a year ago

A shelve is not suitable for a large amount of data - especially when you often need to read and write. Using a database e.g. the ZODB is a much better solution here.

Adam Matan Over a year ago

40,000 string are not big enough for a database, in my opinion.

Pranab Over a year ago

Looking at this answer and your code, I think shelve should suffice for my purposes.

user2665694 · Accepted Answer · 2011-04-03 08:56:32Z

1

Using a shelve is in general a bad idea for large amounts of data. A database is better suited with you have lots of data.

Options are:

ZODB (Python Object Database)
any RDBMS
noSQL world (like MongoDB which is easy approachable and very fast)

answered Apr 3, 2011 at 8:56

user2665694

Comments

Andrey Sboev · Accepted Answer · 2011-04-03 08:43:14Z

0

Hashing is a good idea. To search strings in data bases they use indexes. Since one can define a comparison operation on strings it's possible to build an index which is search tree and process each query with logarithmic complexity

answered Apr 3, 2011 at 8:43

Andrey Sboev

7,7321 gold badge22 silver badges37 bronze badges

Collectives™ on Stack Overflow

Very basic persistence or data storage for a Python script

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related