Replace strings using List Comprehensions

Question

Is it possible to do this example using List Comprehensions:

a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']


for s in a:
    b = [el.replace(s,'') for el in b]

What I want is to delete specific words from list of sentences. I can do it using loop, but I suppose it is possible using some one-line solution.

I tried something like:

b = [[el.replace(s,'') for el in b] for s in a ]

but it goes wrong

I got a lot of quality answers, but now I have on more complication: what if I want to use combination of words?

a = ['test', 'smth commodo']

Thank you for a lot of answers! I made speed test for all the solutions and here is the result: I did it mean of 100 calculations (except the last one, it's too long to wait).

                      b=10 a=2   |  b=9000 a=2 | b=9000 a=100 | b=45k a=500
---------------------------------+-------------+--------------+---------------
COLDSPEED solution:   0.0000206  |  0.0311071  |  0.0943433   |  4.5012770
Jean Fabre solution:  0.0000871  |  0.1722340  |  0.2635452   |  5.2981001
Jpp solution:         0.0000212  |  0.0474531  |  0.0464369   |  0.2450547
Ajax solution:        0.0000334  |  0.0303891  |  0.5262040   | 11.6994496
Daniel solution:      0.0000167  |  0.0162156  |  0.1301132   |  6.9071504
Kasramvd solution:    0.0000120  |  0.0084146  |  0.1704623   |  7.5648351

We can see Jpp solution is the fastest BUT we can't use it - it's the one solution from all others which can't work on combination of words (I already wrote him and hope he will improve his answer!). So looks like the @cᴏʟᴅsᴘᴇᴇᴅ 's solution is the fastest on the big data sets.

And why would you want to rewrite perfectly good code into a one-liner that's more difficult to read? — Aran-Fey
– Aran-Fey, Commented Apr 23, 2018 at 8:31
@Aran-Fey I'm not sure :) I thought it more pythonic and maybe a little faster. — Mikhail_Sam
– Mikhail_Sam, Commented Apr 23, 2018 at 8:33

cs95 · Accepted Answer · 2018-04-23 08:44:55Z

5

There's nothing wrong with what you have, but if you want to clean things up a bit and performance isn't important, then compile a regex pattern and call sub inside a loop.

>>> import re
>>> p = re.compile(r'\b({})\b'.format('|'.join(a)))
>>> [p.sub('', text).strip() for text in b]

['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes'
]

Details
Your pattern will look something like this:

\b    # word-boundary - remove if you also want to replace substrings
(
test  # word 1
|     # regex OR pipe
smth  # word 2 ... you get the picture
)
\b    # end with another word boundary - again, remove for substr replacement

And this is the compiled regex pattern matcher:

>>> p
re.compile(r'\b(test|smth)\b', re.UNICODE)

Another consideration is whether your replacement strings themselves contain characters that could be interpreted by the regex engine differently - rather than being treated as literals - these are regex metacharacters, and you can escape them while building your pattern. That is done using re.escape.

p = re.compile(r'\b({})\b'.format(
    '|'.join([re.escape(word) for word in a]))
)

Of course, keep in mind that with larger data and more replacements, regex and string replacements both become tedious. Consider the use of something more suited to large operations, like flashtext.

edited Apr 23, 2018 at 8:44

answered Apr 23, 2018 at 8:32

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Aran-Fey Over a year ago

You should probably change that join(a) to join(re.escape(word) for word in a) to be on the safe side in case the input contains any special regex characters

cs95 Over a year ago

@Aran-Fey I thought about it... decided it would only serve to confuse. But you're right, will amend in the forthcoming iteration.

Mikhail_Sam Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Thank you again for so fast answer! As I understood, using re is slower than loop. Am I right?

cs95 Over a year ago

@Mikhail_Sam The rule of thumb is that you may not want regex unless you really want regex. However, the best thing to do is to test it out.

cs95 Over a year ago

@Mikhail_Sam I think it should work regardless, will you try it and let me know if it doesn't?

|

Jean-François Fabre · Accepted Answer · 2018-04-23 10:37:18Z

3

If the list is huge, building a ORed list of regular expressions (like "\btest\b|\bsmth\b") can be quite long if the list of words to remove is big (O(n)). regex tests the first word, then the second ...

I suggest you use a replacement function using a set for word lookup. Return the word itself if not found, else return nothing to remove the word:

a = {'test', 'smth'}
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor',
     'Orci varius natoque penatibus et magnis dis parturient montes']

import re

result = [re.sub(r"\b(\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in b]

print(result)

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', ' commodo eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

Now if your list of "words" to replace contain strings composed of 2 words, this method doesn't work, because \w doesn't match spaces. A second pass could be done on the list of "words" made of 2 words:

a = {'lectus ligula', 'porttitor quis'}

and injecting the result in a similar filter but with explicit 2 word match:

result = [re.sub(r"\b(\w+ ?\w+)\b", lambda m : "" if m.group(1) in a else m.group(1),c) for c in result]

So 2 passes but if the list of words is huge, it's still faster than an exhaustive regex.

edited Apr 23, 2018 at 10:37

answered Apr 23, 2018 at 8:38

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

5 Comments

Mikhail_Sam Over a year ago

How to modify it to use for combination of words? I just tried a = {'test', 'smth commodo'} but it fails. As I suppose it's because of \b ... \b in regexp, isn't it?

Jean-François Fabre Over a year ago

aaah it's more difficult because you've got words in your list. The low complexity approach doesn't work as easily. Coldspeed answer should work, even if not super-efficient on a big list.

Jean-François Fabre Over a year ago

your question doesn't ask for that. You could add an edit it (adding some extra condition, without modifying the original). my approach would work with a second pass.

Mikhail_Sam Over a year ago

Yep it works! Thank you. Just want to test, what approach is faster!

Mikhail_Sam Over a year ago

I added speed test for all the solutions. You can check it out in the question now! Thank you

jpp · Accepted Answer · 2018-04-23 08:35:31Z

2

This is an alternative way using set, str.join, str.split and str.strip.

a_set = set(a)

b = [[' '.join([word if word not in a_set else ''
                for word in item.split()]).strip()]
     for item in b]

# [['Lorem ipsum dolor sit amet'],
#  ['consectetur adipiscing elit'],
#  ['Nulla lectus ligula'],
#  ['imperdiet at porttitor quis'],
#  ['commodo eget tortor'],
#  ['Orci varius natoque penatibus et magnis dis parturient montes']]

answered Apr 23, 2018 at 8:35

jpp

166k37 gold badges301 silver badges362 bronze badges

6 Comments

Mikhail_Sam Over a year ago

Interesting solution! But also I need to do one more action - convert list of lists to list of strings, isn't it?

jpp Over a year ago

@Mikhail_Sam, Yep by no means am I implying this is the most efficient solution. You should test with your data if performance matters. If it doesn't, go with the most readable (probably not this one).

Sohaib Farooqi Over a year ago

A slight variation of this answer would be [" ".join(filter(lambda x: x not in a, k.split())) for k in b]

Mikhail_Sam Over a year ago

@jpp It works perfectly. I add some complication for the question - can you improve your solution to work on combination of words please?

Mikhail_Sam Over a year ago

@jpp I added speed test for all the solutions. You can check it out in the question now! Look at your result :)

|

Kasravnd · Accepted Answer · 2018-04-23 08:57:35Z

1

As a pure functional approach (mostly for educational sake) is to utilize partial and reduce functions from functools module along with a map to apply the replacer function on your list of strings.

In [48]: f = partial(reduce, lambda x, y: x.replace(y + ' ', ''), a)

In [49]: list(map(f, b))
Out[49]: 
['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes']

Also if number of items in a is not very large there's nothing wrong with repeating the replace() multiple times. In this case, a very optimized and straightforward way is to use two replace as following:

In [54]: [line.replace(a[0] + ' ', '').replace(a[1] + ' ', '') for line in b]
Out[54]: 
['Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'commodo eget tortor',
 'Orci varius natoque penatibus et magnis dis parturient montes']

answered Apr 23, 2018 at 8:57

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

2 Comments

Mikhail_Sam Over a year ago

Using reduce is an interesting solution! Thank you!

Mikhail_Sam Over a year ago

I added speed test for all the solutions. You can check it out in the question now! Thank you

Daniel · Accepted Answer · 2018-04-23 09:01:27Z

1

You could use map and a regular expression.

import re
a = ['test', 'smth']
b = ['test Lorem ipsum dolor sit amet',
     'consectetur adipiscing elit',
     'test Nulla lectus ligula',
     'imperdiet at porttitor quis',
     'smth commodo eget tortor', 
     'Orci varius natoque penatibus et magnis dis parturient montes']

pattern=r'('+r'|'.join(a)+r')'
b=list(map(lambda x: re.sub(pattern,r'',x).strip(),b))

edited Apr 23, 2018 at 9:01

answered Apr 23, 2018 at 8:56

Daniel

576 bronze badges

1 Comment

Mikhail_Sam Over a year ago

I added speed test for all the solutions. You can check it out in the question now! Thank you

Ajax1234 · Accepted Answer · 2018-04-23 10:55:51Z

1

Another possibility is to join all the word combinations, and then replace \s with | for re.sub:

import re
b = ['test Lorem ipsum dolor sit amet',
 'consectetur adipiscing elit',
 'test Nulla lectus ligula',
 'imperdiet at porttitor quis',
 'smth commodo eget tortor', 
 'Orci varius natoque penatibus et magnis dis parturient montes']
a = ['test', 'smth commodo']
replaced_strings = [re.sub(re.sub('\s', '|', ' '.join(a)), '', i) for i in b]

Output:

[' Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', ' Nulla lectus ligula', 'imperdiet at porttitor quis', '  eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

To remove additional whitespace, apply an additional pass:

new_data = [re.sub('^\s+', '', i) for i in replaced_strings]

Output:

['Lorem ipsum dolor sit amet', 'consectetur adipiscing elit', 'Nulla lectus ligula', 'imperdiet at porttitor quis', 'eget tortor', 'Orci varius natoque penatibus et magnis dis parturient montes']

answered Apr 23, 2018 at 10:55

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

1 Comment

Mikhail_Sam Over a year ago

I added speed test for all the solutions. You can check it out in the question now! Thank you

Abdul Quddus · Accepted Answer · 2018-04-23 09:50:43Z

0

You may be looking for this:

[el.replace(a[0],'').replace(a[1],'') for el in b]

And if you want to remove spaces as well then use strip()

[el.replace(a[0],'').replace(a[1],'').strip() for el in b]

Hope this helps...

answered Apr 23, 2018 at 9:50

Abdul Quddus

1113 silver badges12 bronze badges

1 Comment

Mikhail_Sam Over a year ago

Thank you for solution! The problem is a list may be really long (10-20 strings).

Collectives™ on Stack Overflow

Replace strings using List Comprehensions

7 Answers 7

7 Comments

5 Comments

6 Comments

2 Comments

1 Comment

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

5 Comments

6 Comments

2 Comments

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related