0

I have a large array containing URL (it can contains 100 000 URL strings), and I would like to know if my actual URL is one of the URL from the array. For that, I have to compare the actual URL string with all the URL string in the array. Is there any way to compare with this large array but with less time than I do now ? For now it's :

error = 0
for oldUrl in urlList:
    error = 1 if oldUrl == actualUrl else error
0

3 Answers 3

1

As already mentioned by @Laurent and @sisanared, you can use the in operator for either lists or sets to check for membership. For example:

found = x in some_list
if found: 
    #do stuff
else:
    #other stuff

However, you mentioned that speed is an issue. TL;DR -- sets are faster if the set already exists. From https://wiki.python.org/moin/TimeComplexity, checking membership using the in operator is O(n) for list and O(1) for set (like @enderland pointed out).

For 100,000 items, or for one-time-only checks it probably doesn't make much of a difference which you use, but for a larger number of items or situations where you'll be doing many checks, you should probably use a set. I did a couple of tests from the interpreter and this is what I found (Python 2.7, i3 Windows 10 64bit):

import timeit
#Case 1: Timing includes building the list/set
def build_and_check_a_list(n):
    a_list = [ '/'.join( ('http:stackoverflow.com',str(i)) ) for i in xrange(1,n+1) ]
    check = '/'.join( ('http:stackoverflow.com',str(n)) )
    found = check in a_list
    return (a_list, found)

def build_and_check_a_set(n):
    a_set = set( [ '/'.join( ('http:stackoverflow.com',str(i)) ) for i in xrange(1,n+1) ] )
    check = '/'.join( ('http:stackoverflow.com',str(n)) )
    found = check in a_set
    return (a_set, found)

timeit.timeit('a_list, found = build_and_check_a_list(100000)', 'from __main__ import build_and_check_a_list', number=50)
3.211972302022332

timeit.timeit('a_set, found = build_and_check_a_set(100000)', 'from __main__ import build_and_check_a_set', number=50)
4.5497120006930345

#Case 2: The list/set already exists (timing excludes list/set creation)
check = '/'.join( ('http:stackoverflow.com',str(100000)) )

timeit.timeit('found = check in a_list', 'from __main__ import a_list, check', number=50)
0.12173540635194513

timeit.timeit('found = check in a_set', 'from __main__ import a_set, check', number=50)
1.01052391983103e-05

For 1 million entries, to build and/or check membership on my computer:

#Case 1: list/set creation included
timeit.timeit('a_list, found = build_and_check_a_list(1000000)', 'from __main__ import build_and_check_a_list', number=50)
35.71641090788398

timeit.timeit('a_set, found = build_and_check_a_set(1000000)', 'from __main__ import build_and_check_a_set', number=50)
51.41244436103625

#Case 2: list/set already exists
check = '/'.join( ('http:stackoverflow.com',str(1000000)) )

timeit.timeit('found = check in a_list', 'from __main__ import a_list, check', number=50)
1.3113457772124093

timeit.timeit('found = check in a_set', 'from __main__ import a_set, check', number=50)
8.180430086213164e-06
Sign up to request clarification or add additional context in comments.

Comments

1

To check if a list contains an item, use: item in list.

So, you can write:

error = oldUrl in urlList

Comments

0

Don't use a list for this. Lookups in lists have a worst case complexity of O(n).

Use a set (or dictionary if you have other metadata) instead. This has a lookup of roughly O(1). See here for comparisons between a set, dictionary, and list.

Using a set, the lookup is simple:

urls = set(['url1', 'url2', 'url3'])

print ('url2' in urls)
print ('foobar' in urls)

Or in your case, convert your list object as a set:

urlListSet = set(urlList)
print(oldUrl in urlListSet)

You can also add new urls to your set:

urlListSet.add(newurl)
urlListSet.update(listOfNewUrls)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.