Python: Check if string and its substring are existing in the same list

Question

I've extracted keywords based on 1-gram, 2-gram, 3-gram within a tokenized sentence

list_of_keywords = []
for i in range(0, len(stemmed_words)):
    temp = []
    for j in range(0, len(stemmed_words[i])):
        temp.append([' '.join(x) for x in list(everygrams(stemmed_words[i][j], 1, 3)) if ' '.join(x) in set(New_vocabulary_list)])
    list_of_keywords.append(temp)

I've obtained keywords list as

['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
['sleep', 'anxiety', 'lack of sleep']

How can I simply the results by removing all substring within the list and remain:

['high blood pressure']
['anxiety', 'lack of sleep']

Will all sub strings be split by a space? What should ['sub', 'string', 'substring'] become? — Peilonrayz
– Peilonrayz, Commented Mar 15, 2019 at 17:30

Chetan Ameta · Accepted Answer · 2019-03-15 09:55:07Z

13

You could use this one liner:

b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i]   )]

I admit this is O(n²) and maybe will be slow in performance for large inputs.

This is basically a list comprehension of the following:

word_list =  ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

result = []
for this_word in word_list:
    words_without_this_word = [ other_word  for other_word in word_list if other_word != this_word]  
    found = False
    for other_word in words_without_this_word:
        if this_word in other_word:
            found = True

    if not found:
        result.append(this_word)

result

edited Mar 15, 2019 at 9:55

Chetan Ameta

7,8943 gold badges34 silver badges46 bronze badges

answered Mar 15, 2019 at 9:42

Christian Sloper

7,5303 gold badges17 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

beruic Over a year ago

I believe this will be slightly faster by removing the inner list comprehension, such that it becomes a generator comprehension, like so: result = [i for i in b if not any(i in a for a in b if a != i)]

Eric Duminil · Accepted Answer · 2019-03-15 17:14:49Z

If you have a large list of words, it might be a good idea to use a suffix tree.

Here's a package on PyPI.

Once you created the tree, you can call find_all(word) to get the index of every occurence of word. You simply need to keep the strings which only appear once:

from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees

words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)

st.find_all('blood')
# [0, 20, 26, 46]

st.find_all('high blood pressure')
# [41]

[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']

words needs to be a unique list of strings, so you might need to call list(set(words)) before generating the suffix-tree.

As far as I can tell, the whole script should run in O(n), with n being the total length of the strings.

Anna Janiszewska · Accepted Answer · 2019-03-15 10:07:35Z

-1

assuming that order of your elements is from shortest string to longest string, you need to check if each element is substring of last one and then remove it from the list:

symptoms = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']


def removeSubstring(data):
    for symptom in data[:-1]:
        if symptom in data[-1]:
            print("Removing: ", symptom)
            data.remove(symptom)
    print(data)


removeSubstring(symptoms)

answered Mar 15, 2019 at 10:07

Anna Janiszewska

1

4 Comments

Lisa Over a year ago

Thanks, but the way u suggested would be only workable for 1 longest string, simply tried with symptoms = ['blood', 'sleep', 'high blood pressure', 'lack of sleep']

Christian Sloper Over a year ago

It’s normally a real bad idea to remove things from a list while you are iterating over it.

Anna Janiszewska Over a year ago

@ChristianSloper can you elaborate why?

Christian Sloper Over a year ago

quora.com/…

Venfah Nazir · Accepted Answer · 2019-03-15 10:09:15Z

-1

words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

superset_word = ''
#print (words)
for word in words:
    word_list_minus_word = [each for each in words if word != each]
    counter = 0
    for other_word in word_list_minus_word:
        if (other_word not in word):
            break
        else:
            counter += 1
    if (counter == len(word_list_minus_word)):
        superset_word = word
        break
print(superset_word)

answered Mar 15, 2019 at 10:09

Venfah Nazir

3302 silver badges7 bronze badges

Comments

Vasilis G. · Accepted Answer · 2019-03-15 18:40:35Z

-3

grams = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

unique_grams = [grams[i] for i in range(len(grams)) if not grams[i] in ' '.join(grams[i+1:])]

edited Mar 15, 2019 at 18:40

Vasilis G.

7,9074 gold badges23 silver badges32 bronze badges

answered Mar 15, 2019 at 12:21

Jawad Ali Khan

1

1 Comment

Eric Duminil Over a year ago

It doesn't seem to work. For example with grams = ['a b c', 'b c', 'a', 'b', 'c'].

Collectives™ on Stack Overflow

Python: Check if string and its substring are existing in the same list

5 Answers 5

1 Comment

Comments

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related