7

Is it possible to do this example using List Comprehensions:

a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']


for s in a:
    b = [el.replace(s,'') for el in b]

What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.

I tried something like:

b = [[el.replace(s,'') for el in b] for s in a ]

but it goes wrong


I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?

a = ['test', 'smth commodo']

Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).

                      b=10 a=2   |  b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution:   0.0000206  |  0.0311071  |  0.0943433   |  4.5012770
Jean Fabre solution:  0.0000871  |  0.1722340  |  0.2635452   |  5.2981001
Jpp solution:         0.0000212  |  0.0474531  |  0.0464369   |  0.2450547
Ajax solution:        0.0000334  |  0.0303891  |  0.5262040   | 11.6994496
Daniel solution:      0.0000167  |  0.0162156  |  0.1301132   |  6.9071504
Kasramvd solution:    0.0000120  |  0.0084146  |  0.1704623   |  7.5648351

We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.

2
  • 1
    And why would you want to rewrite perfectly good code into a one-liner that's more difficult to read? Commented Apr 23, 2018 at 8:31
  • 1
    @Aran-Fey I'm not sure :) I thought it more pythonic and maybe a little faster. Commented Apr 23, 2018 at 8:33

7 Answers 7

5

There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub inside a loop.

>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]

['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes'
]

Details
Your pattern will look something like this:

\b    # word-boundary - remove if you also want to replace substrings
(
test  # word 1
|     # regex OR pipe
smth  # word 2 ... you get the picture
)
\b    # end with another word boundary - again, remove for substr replacement

And this is the compiled regex pattern matcher:

>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)

Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape.

p = re.compile(r'\b({})\b'.format(
    '|'.join([re.escape(word) for word in a]))
)

Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext.

Sign up to request clarification or add additional context in comments.

7 Comments

You should probably change that join(a) to join(re.escape(word) for word in a) to be on the safe side in case the input contains any special regex characters
@Aran-Fey I thought about it... decided it would only serve to confuse. But you're right, will amend in the forthcoming iteration.
@cᴏʟᴅsᴘᴇᴇᴅ Thank you again for so fast answer! As I understood, using re is slower than loop. Am I right?
@Mikhail_Sam The rule of thumb is that you may not want regex unless you really want regex. However, the best thing to do is to test it out.
@Mikhail_Sam I think it should work regardless, will you try it and let me know if it doesn't?
|
3

If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b") can be quite long if the list of words to remove is big (O(n)). regex tests the first word, then the second ...

I suggest you use a replacement function using a set for word lookup. Return the word itself if not found, else return nothing to remove the word:

a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor',
     'Orci varius natoque penatibus et magnis dis parturient montes']

import re

result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]

print(result)

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:

a = {'lectus ligula', 'porttitor quis'}

and injecting the result in a similar filter but with explicit 2 word match:

result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]

So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.

5 Comments

How to modify it to use for combination of words? I just tried a = {'test', 'smth commodo'} but it fails. As I suppose it's because of \b ... \b in regexp, isn't it?
aaah it's more difficult because you've got words in your list. The low complexity approach doesn't work as easily. Coldspeed answer should work, even if not super-efficient on a big list.
your question doesn't ask for that. You could add an edit it (adding some extra condition, without modifying the original). my approach would work with a second pass.
Yep it works! Thank you. Just want to test, what approach is faster!
I added speed test for all the solutions. You can check it out in the question now! Thank you
2

This is an alternative way using set, str.join, str.split and str.strip.

a_set = set(a)

b = [[' '.join([word if word not in a_set else ''
                for word in item.split()]).strip()]
     for item in b]

# [['Lorem ipsum dolor sit amet'],
#  ['consectetur adipiscing elit'],
#  ['Nulla lectus ligula'],
#  ['imperdiet at porttitor quis'],
#  ['commodo eget tortor'],
#  ['Orci varius natoque penatibus et magnis dis parturient montes']]

6 Comments

Interesting solution! But also I need to do one more action - convert list of lists to list of strings, isn't it?
@Mikhail_Sam, Yep by no means am I implying this is the most efficient solution. You should test with your data if performance matters. If it doesn't, go with the most readable (probably not this one).
A slight variation of this answer would be [" ".join(filter(lambda x: x not in a, k.split())) for k in b]
@jpp It works perfectly. I add some complication for the question - can you improve your solution to work on combination of words please?
@jpp I added speed test for all the solutions. You can check it out in the question now! Look at your result :)
|
1

As a pure functional approach (mostly for educational sake) is to utilize partial and reduce functions from functools module along with a map to apply the replacer function on your list of strings.

In [48]: f = partial(reduce, lambda x, y: x.replace(y + ' ', ''), a)

In [49]: list(map(f, b))
Out[49]: 
['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes']

Also if number of items in a is not very large there's nothing wrong with repeating the replace() multiple times. In this case, a very optimized and straightforward way is to use two replace as following:

In [54]: [line.replace(a[0] + ' ', '').replace(a[1] + ' ', '') for line in b]
Out[54]: 
['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes']

2 Comments

Using reduce is an interesting solution! Thank you!
I added speed test for all the solutions. You can check it out in the question now! Thank you
1

You could use map and a regular expression.

import re
a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']

pattern=r'('+r'|'.join(a)+r')'
b=list(map(lambda x: re.sub(pattern,r'',x).strip(),b))

1 Comment

I added speed test for all the solutions. You can check it out in the question now! Thank you
1

Another possibility is to join all the word combinations, and then replace \s with | for re.sub:

import re
b = ['test Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'test Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'smth commodo eget tortor', 
 'Orci varius natoque penatibus et magnis dis parturient montes']
a = ['test', 'smth commodo']
replaced_strings = [re.sub(re.sub('\s', '|', ' '.join(a)), '', i) for i in b]

Output:

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', '  eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

To remove additional whitespace, apply an additional pass:

new_data = [re.sub('^\s+', '', i) for i in replaced_strings]

Output:

['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', 'Nulla lectus ligula', 'imperdiet at porttitor quis', 'eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

1 Comment

I added speed test for all the solutions. You can check it out in the question now! Thank you
0

You may be looking for this:

[el.replace(a[0],'').replace(a[1],'') for el in b]

And if you want to remove spaces as well then use strip()

[el.replace(a[0],'').replace(a[1],'').strip() for el in b]

Hope this helps...

1 Comment

Thank you for solution! The problem is a list may be really long (10-20 strings).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.