3

I have a script which processes a list of URLs. The script may be called at any time with a fresh list of URLs. I want to avoid processing an URL which has already been processed at any time in the past.

At this point, all I want to match are URLs, which are really long strings, against all previously processed URLs, to ensure uniqueness.

My question is, how does an SQL query matching a text URL against a MySQL database of only URLs (say 40000 long text URLs) compare, against my other idea of hashing the URLs and saving the hashes using, say, Python's shelve module?

shelf[hash(url)] = 1

Is shelve usable for a dictionary with 40000 string keys? What about with 40000 numerical keys with binary values? Any gotchas with choosing shelve over MySQL for this simple requirement?

Or, if I use a DB, is there a huge benefit to store URL hashes in my MySQL DB instead of string URLs?

2
  • i think your idea of saving the hash is very good which can boost the performance a lot. but try to choose a hash algorithm which has low collision. Commented Apr 3, 2011 at 8:40
  • And the specific question is? Commented Apr 3, 2011 at 8:42

3 Answers 3

3

Store the URLs in a set, which assures O(1) for finding items, and shelve it. At this amount of URLs, storing and restoring will take very little time and memory:

import shelve

# Write URLS to shelve
urls= ['http://www.airmagnet.com/', 'http://www.alcatel-lucent.com/',
       'http://www.ami.com/', 'http://www.apcc.com/', 'http://www.stk.com/',
       'http://www.apani.com/', 'http://www.apple.com/',
       'http://www.arcoide.com/', 'http://www.areca.com.tw/',
       'http://www.argus-systems.com/', 'http://www.ariba.com/',
       'http://www.asus.com.tw/']

s=set(urls)                        # Store URLs as set - Search is O(1)
sh=shelve.open('/tmp/shelve.tmp')  # Dump set (as one unit) to shelve file
sh['urls']=s
sh.close()

sh=shelve.open('/tmp/shelve.tmp')  # Retrieve set from file
s=sh['urls']
print 'http://www.apple.com/' in s # True
print 'http://matan.name/'    in s # False

This approach is quite fast:

import random
import string
import shelve
import datetime


urls=[''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(50))
          for i in range(40000)]
s=set(urls)
start=datetime.datetime.now()
sh=shelve.open('/tmp/test.shelve')
sh['urls']=urls
end=datetime.datetime.now()
print end-start
Sign up to request clarification or add additional context in comments.

3 Comments

A shelve is not suitable for a large amount of data - especially when you often need to read and write. Using a database e.g. the ZODB is a much better solution here.
40,000 string are not big enough for a database, in my opinion.
Looking at this answer and your code, I think shelve should suffice for my purposes.
1

Using a shelve is in general a bad idea for large amounts of data. A database is better suited with you have lots of data.

Options are:

  • ZODB (Python Object Database)
  • any RDBMS
  • noSQL world (like MongoDB which is easy approachable and very fast)

Comments

0

Hashing is a good idea. To search strings in data bases they use indexes. Since one can define a comparison operation on strings it's possible to build an index which is search tree and process each query with logarithmic complexity

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.