1

I have two lists of words to compare. First list is 2 million words, second list is 150,000 words. What I need to do is to apply binary search to see if words of the first list appear in the second. I was trying liner search:

for word in words_list:
    if word in dict_list:
       print(word, 1)
    else:
       print(word, 0)

It works good, but it is very slow. Then I tried binary search but it did not work correctly:

for word in wordlist:
    lb = 0
    ub = len(dict_list)
    mid_index = (lb + ub) // 2
    item_at_mid = dict_list[mid_index]
    if item_at_mid == word:
        print(word)
    if item_at_mid < word:
        lb = mid_index + 1
    else:
        ub = mid_index

In the end I need two list first list of words that are in dictionary and second that are not.

3
  • linux command will do that for you cmp file1 file2 Commented Apr 5, 2016 at 10:00
  • Are you actually printing the values, or is this for demonstation? Printing is super slow compared to putting the elements into two new lists. Commented Apr 5, 2016 at 10:03
  • it is a demostrationa Commented Apr 5, 2016 at 10:04

4 Answers 4

2

You can use sets:

inter = set(words_list).intersection(dict_list)
for word in words_list:
    if word in inter:
        print(word, 1)
    else:
        print(word, 0)
Sign up to request clarification or add additional context in comments.

2 Comments

I do not have dictionary I have 2 lists.
OK, so just use dict_list.keys() -> dict_list. I thought it's a dict because of the name. edited
1

If you want to do a Binary Search:

present = []
absent = []
for word in firstList:
    lb,ub = 0,len(secondList) - 1
    found = False
    while lb <= ub:
        mid = (lb + ub) // 2
        if secondList[mid] == word:
            found = True
            break
        elif secondList[mid] < word:
            lb = mid + 1
        else:
            ub = mid - 1

    if found:
        present.append(word)
    else:
        absent.append(word)

Your binary search code was incorrect.

Comments

1

If you are using binary search your input should have been ordered. One other possibility is convert your words_list and dict_list to set and get the output as follows:

words common to both:

words_list.intersection(dict_list)

words not in one other:

words_list-dict_list
dict_list-words_list

2 Comments

From the wording it is implied he can get away with only making a set out of words_list. Then using intersection() and difference() and passing dict_list as a list will achieve the same result.
@Reti43: I was thinking same. Later i looked at the comments of previous answer which confirms only the name contains the word dict but actually it is list.
0

A solution is to prefer using set instead of list, because of the O(1) complexity for __contains__ operation, as given in this solution.

If memory is a problem, then using a bloom filter is probably a good tradeoff (no false negative).

Here is a python implementation.


To create and maintain a binary tree, consider using heapq standard module.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.