12

I've extracted keywords based on 1-gram, 2-gram, 3-gram within a tokenized sentence

list_of_keywords = []
for i in range(0, len(stemmed_words)):
    temp = []
    for j in range(0, len(stemmed_words[i])):
        temp.append([' '.join(x) for x in list(everygrams(stemmed_words[i][j], 1, 3)) if ' '.join(x) in set(New_vocabulary_list)])
    list_of_keywords.append(temp)

I've obtained keywords list as

['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
['sleep', 'anxiety', 'lack of sleep']

How can I simply the results by removing all substring within the list and remain:

['high blood pressure']
['anxiety', 'lack of sleep']
1
  • Will all sub strings be split by a space? What should ['sub', 'string', 'substring'] become? Commented Mar 15, 2019 at 17:30

5 Answers 5

13

You could use this one liner:

b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i]   )]

I admit this is O(n2) and maybe will be slow in performance for large inputs.

This is basically a list comprehension of the following:

word_list =  ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

result = []
for this_word in word_list:
    words_without_this_word = [ other_word  for other_word in word_list if other_word != this_word]  
    found = False
    for other_word in words_without_this_word:
        if this_word in other_word:
            found = True

    if not found:
        result.append(this_word)

result
Sign up to request clarification or add additional context in comments.

1 Comment

I believe this will be slightly faster by removing the inner list comprehension, such that it becomes a generator comprehension, like so: result = [i for i in b if not any(i in a for a in b if a != i)]
1

If you have a large list of words, it might be a good idea to use a suffix tree.

Here's a package on PyPI.

Once you created the tree, you can call find_all(word) to get the index of every occurence of word. You simply need to keep the strings which only appear once:

from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees

words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)

st.find_all('blood')
# [0, 20, 26, 46]

st.find_all('high blood pressure')
# [41]

[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']

words needs to be a unique list of strings, so you might need to call list(set(words)) before generating the suffix-tree.

As far as I can tell, the whole script should run in O(n), with n being the total length of the strings.

Comments

-1

assuming that order of your elements is from shortest string to longest string, you need to check if each element is substring of last one and then remove it from the list:

symptoms = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']


def removeSubstring(data):
    for symptom in data[:-1]:
        if symptom in data[-1]:
            print("Removing: ", symptom)
            data.remove(symptom)
    print(data)


removeSubstring(symptoms)

4 Comments

Thanks, but the way u suggested would be only workable for 1 longest string, simply tried with symptoms = ['blood', 'sleep', 'high blood pressure', 'lack of sleep']
It’s normally a real bad idea to remove things from a list while you are iterating over it.
@ChristianSloper can you elaborate why?
-1
words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

superset_word = ''
#print (words)
for word in words:
    word_list_minus_word = [each for each in words if word != each]
    counter = 0
    for other_word in word_list_minus_word:
        if (other_word not in word):
            break
        else:
            counter += 1
    if (counter == len(word_list_minus_word)):
        superset_word = word
        break
print(superset_word)

Comments

-3
grams = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

unique_grams = [grams[i] for i in range(len(grams)) if not grams[i] in ' '.join(grams[i+1:])]

1 Comment

It doesn't seem to work. For example with grams = ['a b c', 'b c', 'a', 'b', 'c'].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.