5

Hi I am working with python 3 and I've been facing this issue for a while now and I can't seem to figure this out.

I have 2 numpy arrays containing strings

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])

If you notice, the array_one is actually an array containing 1-gram, 2-gram, 3-gram, 4-gram, 5-gram for the sentence alice in a wonder land.

I purposefully have taken wonderland as two words wonder and land.

Now I have another numpy array that contains some locations and names.

array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

Now what I want to do is get all the elements in the array_one that exist in array_two.

If I take out an intersection using np.intersect1d of the two arrays I don't get any matches since wonderland is two separate words in array_one while in array_two it's a single word.

Is there any way to do this? I've tried solutions from stack (this) but they don't seem to work with python 3

array_one would at max have 60-100 items while array_two would at max have roughly 1 million items but an average of 250,000 - 500,000 items.


Edit

I've used a very naive approach since I wasn't able to find a solution uptill now, I replaced white space from both arrays and then using the resultant boolean array ([True, False, True]) to `filter on the origional array. Below is the code:

import numpy.core.defchararray as np_f
import numpy as np


array_two_wr = np_f.replace(array_two, ' ', '')
array_one_wr = np_f.replace(array_one, ' ', '')
intersections = array_two[np.in1d(array_two_wr, array_one_wr)]

But I am not sure this is the way to go considering the number of elements in array_two

3
  • can you try to use the levenshtein distance?? en.wikipedia.org/wiki/Levenshtein_distance Commented Feb 21, 2018 at 18:49
  • @EspoirMurhabazi I thought of levenshtein distance and Cosine string matching but the problem is how do I implement them without using two for loops, that is the first issue and second, I need something that handles white space since a levenshtein distance of 1 would match block A and block B as the same while the cosine would match them at 0.90. Commented Feb 21, 2018 at 18:53
  • Maybe you can use locality-sensitive hashing as discussed in this SO question Commented Feb 21, 2018 at 19:33

2 Answers 2

3

Minhashing could definitely be used here. Here's the very general idea behind minhashing: for each object in a list, hash the object multiple times, and update an object that keeps track of the hashes computed for each list member. Then examine the set of resulting hashes, and for each, find all objects for which that hash was computed (we just stored this data). The objects for which the same hash was computed will be very similar if the hashing function is chosen carefully.

For a more detailed explanation of minhashing, see Chapter 3 of Mining Massive Datasets.

Here's a sample Python 3 implementation using your data and datasketch (pip install datasketch), which computes the hashes:

import numpy as np
from datasketch import MinHash, MinHashLSH
from nltk import ngrams

def build_minhash(s):
  '''Given a string `s` build and return a minhash for that string'''
  new_minhash = MinHash(num_perm=256)
  # hash each 3-character gram in `s`
  for chargram in ngrams(s, 3):
    new_minhash.update(''.join(chargram).encode('utf8'))
  return new_minhash

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# create a structure that lets us query for similar minhashes
lsh = MinHashLSH(threshold=0.3, num_perm=256)

# loop over the index and value of each member in array two
for idx, i in enumerate(array_two):
  # add the minhash to the lsh index
  lsh.insert(idx, build_minhash(i))

# find the items in array_one with 1+ matches in arr_two
for i in array_one:
  result = lsh.query(build_minhash(i))
  if result:
    matches = ', '.join([array_two[j] for j in result])
    print(' *', i, '--', matches)

Results (array_one member on left, array_two match on right):

 * wonder -- wonderland
 * a wonder -- wonderland
 * wonder land -- wonderland
 * a wonder land -- wonderland
 * in a wonder land -- wonderland
 * alice in a wonder land -- wonderland

The easiest way to tune for precision/recall here is by changing the threshold argument to MinHashLSH. You can also try modifying the hashing technique itself. Here I used 3-character hashes when building the minhash for each ngram, a technique Yale's Digital Humanities lab found tremendously powerful at capturing text similarity: https://github.com/YaleDHLab/intertext

Sign up to request clarification or add additional context in comments.

7 Comments

I am having some trouble with this code (yes with the problem statement, this seems more viable solution). The issue is, array_one can change based on input but the array_two remains constant for all cases. I am having trouble creating that combined numpy.array since I was thinking I'll process array_two into a lsh object and store as a pickle
@iam.Carrot I just updated the above so array_two is static and array_one can be dynamic. When array_one updates, just build the minhash for the new element(s), then query the LSH index, which now only contains array_two elements. Does that make sense?
ahh! I almost had the same code (tiny bit of difference). I'll quickly test it out.
it works just fine. Thank you for that snippet, I had put array_one instead of the array_two by mistake in join([array_two[j] for j in result]). I have two final questions, 1. how much can the MinHash hold? can it hold 1 million or even 10 million per say? and 2. If I have to apply Cosine or Levenshtien or even JaroWinkler, does LSH have a provision for em? or do I post process it?
perfect. Thank you so much for taking the time to help me out. I really appreciate it.
|
2

Sorry to post two answers, but after adding the locality-sensitive-hashing technique above, I realized you could exploit the class separation in your data (query vectors and potential matching vectors) by using a bloom filter.

A bloom filter is a beautiful object that lets you pass in some objects, then query to see whether a given object has been added to the bloom filter. Here's an awesome visual demo of a bloom filter.

In your case we can add each member of array_two to the bloom filter, then query each member of array_one to see whether it's in the bloom filter. Using pip install bloom-filter:

from bloom_filter import BloomFilter # pip instal bloom-filter
import numpy as np
import re

def clean(s):
  '''Clean a string'''
  return re.sub(r'\s+', '', s)

array_one = np.array(['alice', 'in', 'a', 'wonder', 'land', 'alice in', 'in a', 'a wonder', 'wonder land', 'alice in a', 'in a wonder', 'a wonder land', 'alice in a wonder', 'in a wonder land', 'alice in a wonder land'])
array_two = np.array(['new york', 'las vegas', 'wonderland', 'florida'])

# initialize bloom filter with particular size
bloom = BloomFilter(max_elements=10000, error_rate=0.1)
# add each member of array_two to bloom filter
[bloom.add(clean(i)) for i in array_two]
# find the members in array_one in array_two
matches = [i for i in array_one if clean(i) in bloom]
print(matches)

Result: ['wonder land']

Depending on your requirements, this could be a very efficient (and highly-scalable) solution.

5 Comments

you've set a max_elements=10000 is there a significance of it? Can I set it to 1 million?
Yes, you should set the max_elements param to the number of elements you intend to pass in (you may need decent hardware to do it all in main memory). The bloom filter is simpler but much less intelligent than LSH -- looking at your data will help you decide what's best...
Great. I have 1 last query, how does the BloomFilter handle duplicates? I am sorry if my comments are a little naïve since I just got introduced to these ways and why doesn't it match anything if the string length is less for example it doesn't match a c c road to acc road and when I put in acc road it still doesn't match and I run it a few times, it starts to match all the mentioned cases. is there something I am missing?
@iam.Carrot check out that visual demo I linked above. Your acc road example works as expected on my machine (ie match is found). Duplicate entries should have no effect on a bloom filter. However, a bloom filter won't handle substring matching -- that's what more abstract methods like LSH are for. Bloom filters are fast if you only need to find raw set intersections, but if you need something fuzzier LSH techniques are more flexible. I hope that helps!
no yeah it does match, but I had to run the code a few times for it to match. I thought I am missing something. Maybe something trivial. Never the less, you've been of great help. Thanks much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.