Python replace function [replace once]

Question

I need help with a program I'm making in Python.

Assume I wanted to replace every instance of the word "steak" to "ghost" (just go with it...) but I also wanted to replace every instance of the word "ghost" to "steak" at the same time. The following code does not work:

 s="The scary ghost ordered an expensive steak"
 print s
 s=s.replace("steak","ghost")
 s=s.replace("ghost","steak")
 print s

it prints: The scary steak ordered an expensive steak

What I'm trying to get is The scary steak ordered an expensive ghost

Do you want unghosted to becomes unsteaked? (The example that comes to mind from a question the other week was "name" and "enamel".) — DSM
– DSM, Commented Mar 10, 2013 at 16:16
Hm, but maybe you want ghosts to be converted to steaks and vice-versa. It's what .replace would do. — nneonneo
– nneonneo, Commented Mar 10, 2013 at 16:18
Do you want to replace only once? Or every instance? You should clarify that portion of your question ... (especially considering the attention it is getting) — mgilson
– mgilson, Commented Mar 10, 2013 at 18:26

mgilson · Accepted Answer · 2014-03-26 13:38:27Z

25

I'd probably use a regex here:

>>> import re
>>> s = "The scary ghost ordered an expensive steak"
>>> sub_dict = {'ghost':'steak','steak':'ghost'}
>>> regex = '|'.join(sub_dict)
>>> re.sub(regex, lambda m: sub_dict[m.group()], s)
'The scary steak ordered an expensive ghost'

Or, as a function which you can copy/paste:

import re
def word_replace(replace_dict,s):
    regex = '|'.join(replace_dict)
    return re.sub(regex, lambda m: replace_dict[m.group()], s)

Basically, I create a mapping of words that I want to replace with other words (sub_dict). I can create a regular expression from that mapping. In this case, the regular expression is "steak|ghost" (or "ghost|steak" -- order doesn't matter) and the regex engine does the rest of the work of finding non-overlapping sequences and replacing them accordingly.

Some possibly useful modifications

regex = '|'.join(map(re.escape,replace_dict)) -- Allows the regular expressions to have special regular expression syntax in them (like parenthesis). This escapes the special characters to make the regular expressions match the literal text.
regex = '|'.join(r'\b{0}\b'.format(x) for x in replace_dict) -- make sure that we don't match if one of our words is a substring in another word. In other words, change he to she but not the to tshe.

edited Mar 26, 2014 at 13:38

answered Mar 10, 2013 at 16:08

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

DSM Over a year ago

One advantage of using regexes is that you could add boundary markers to help prevent accidental substring matches. [You could make that work without regexes too, of course.]

mgilson Over a year ago

@DSM -- Sure -- And really that just becomes a question of modifying how you join the sub_dict here. Something like: '|'.join(r'(?:\b{}\b)'.format(x) for x in regex) -- if I'm remembering my regex syntax properly :)

nneonneo Over a year ago

(Oh, and close the quote. :P)

mgilson Over a year ago

I suppose that's why I usually test these things in the interactive interpretter.

John La Rooy Over a year ago

Good idea to escape the strings so it doesn't break when someone adds a string with special meaning regex = '|'.join(re.escape(k) for k in sub_dict)

|

nneonneo · Accepted Answer · 2013-03-10 16:11:36Z

13

Split the string by one of the targets, do the replace, and put the whole thing back together.

pieces = s.split('steak')
s = 'ghost'.join(piece.replace('ghost', 'steak') for piece in pieces)

This works exactly as .replace() would, including ignoring word boundaries. So it will turn "steak ghosts" into "ghost steaks".

answered Mar 10, 2013 at 16:11

nneonneo

181k37 gold badges331 silver badges412 bronze badges

10 Comments

Abhijit Over a year ago

+1: This has better performance than the regex solution. See my deleted answer

mgilson Over a year ago

@Abhijit -- Interesting, but not too surprising that this would be faster than re.sub. I still like re.sub though ... I find it to be a little more explicit and flexible (e.g. adding word boundaries). Note that if you really want this to go faster, you could put the stuff in join in a list-comp instead of a generator expression. That is always faster for .join ... by stylistically it's uglier.

mgilson Over a year ago

@nneonneo -- I guess I should read the source on ''.join for python3. On python2, the generator gets turned into a list anyway which is why it's always faster. But maybe they did something different on python3.x?

nneonneo Over a year ago

@mgilson: OOPS. Misread the timeit results. I guess DST really messed me up. .join is always 10~15% slower if given a genexpr than if it's given a list.

mgilson Over a year ago

@nneonneo -- you threw me for a loop for a minute there :). join needs a list so that it can iterate through it twice. The first iteration figures out what size the output string should be so that it can allocate enough memory.

|

Mark Tolonen · Accepted Answer · 2013-03-10 16:11:02Z

4

Rename one of the words to a temp value that doesn't occur in the text. Note this wouldn't be the most efficient way for a very large text. For that a re.sub might be more appropriate.

 s="The scary ghost ordered an expensive steak"
 print s
 s=s.replace("steak","temp")
 s=s.replace("ghost","steak")
 S=s.replace("temp","steak")
 print s

edited Mar 10, 2013 at 16:11

answered Mar 10, 2013 at 15:59

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

8 Comments

user2154113 Over a year ago

But say I wanted to have a user input for it, is there anyway to ensure they don't use the temp value WITHOUT making the temp value a random series of letters like "zxdasd"?

Bakuriu Over a year ago

@user2154113 you can first check if the value is in the string. Anyway, this is the wrong way of doing it. The right way is Leif Andersen's.

Mark Tolonen Over a year ago

And how does Leif's answer replace every instance if there is more than one instance? This way actually works.

nneonneo Over a year ago

This works but requires shifty temporary values. (You can use some random Unicode to reduce the risk, but it is not a very general solution).

nneonneo Over a year ago

@MarkTolonen: Yeah, but generally you wish it that the temp was something that could not possibly exist in a string (i.e. it should be out-of-band).

|

Leif Andersen · Accepted Answer · 2013-03-10 16:01:47Z

2

Use the count variable in the string.replace() method. So using your code, you wouold have:

s="The scary ghost ordered an expensive steak"
print s
s=s.replace("steak","ghost", 1)
s=s.replace("ghost","steak", 1)
print s

http://docs.python.org/2/library/stdtypes.html

answered Mar 10, 2013 at 16:01

Leif Andersen

22.4k21 gold badges75 silver badges109 bronze badges

4 Comments

Mark Tolonen Over a year ago

This would work, but only for this specific example. If there were more instances of the words or even if the words were in a different order it wouldn't work.

Leif Andersen Over a year ago

True, but since we weren't given any other context for why the words wanted to be changed, this is the best way to do it. But yes, if he did want a more robust solution, than this won't work.

Ryan Amos Over a year ago

+1 for being the only "replace once". All other answers solve the general case of swapping X,Y in a string, but yours answers the question.

mgilson Over a year ago

@RyanAmos -- "replace once" is in the title, but if you read the body of the question it says "every instance" ...

Bartek · Accepted Answer · 2013-03-10 16:12:06Z

2

How about something like this? Store the original in a split list, then have a translation dict. Keeps your core code short, then just adjust the dict when you need to adjust the translation. Plus, easy to port to a function:

 def translate_line(s, translation_dict):
    line = []
    for i in s.split():
       # To take account for punctuation, strip all non-alnum from the
       # word before looking up the translation.
       i = ''.join(ch for ch in i if ch.isalnum()]
       line.append(translation_dict.get(i, i))
    return ' '.join(line)


 >>> translate_line("The scary ghost ordered an expensive steak", {'steak': 'ghost', 'ghost': 'steak'})
 'The scary steak ordered an expensive ghost'

edited Mar 10, 2013 at 16:12

answered Mar 10, 2013 at 16:07

Bartek

15.7k2 gold badges61 silver badges66 bronze badges

2 Comments

Tim Pietzcker Over a year ago

+1 for the idea, but you're not quite there yet. This only works if there is no punctuation.

Bartek Over a year ago

@TimPietzcker Yah, definitely not a robust solution :) You could probably strip all non-alphanumeric characters from the word and then use that for translation .. I'll adjust

Abhijit · Accepted Answer · 2013-03-11 06:07:22Z

Note Considering the viewership of this Question, I undeleted and rewrote it for different types of test cases

I have considered four competing implementations from the answers

>>> def sub_noregex(hay):
    """
    The Join and replace routine which outpeforms the regex implementation. This
    version uses generator expression
    """
    return 'steak'.join(e.replace('steak','ghost') for e in hay.split('ghost'))

>>> def sub_regex(hay):
    """
    This is a straight forward regex implementation as suggested by @mgilson
    Note, so that the overheads doesn't add to the cummulative sum, I have placed
    the regex creation routine outside the function
    """
    return re.sub(regex,lambda m:sub_dict[m.group()],hay)

>>> def sub_temp(hay, _uuid = str(uuid4())):
    """
    Similar to Mark Tolonen's implementation but rather used uuid for the temporary string
    value to reduce collission
    """
    hay = hay.replace("steak",_uuid).replace("ghost","steak").replace(_uuid,"steak")
    return hay

>>> def sub_noregex_LC(hay):
    """
    The Join and replace routine which outpeforms the regex implementation. This
    version uses List Comprehension
    """
    return 'steak'.join([e.replace('steak','ghost') for e in hay.split('ghost')])

A generalized timeit function

>>> def compare(n, hay):
    foo = {"sub_regex": "re",
           "sub_noregex":"",
           "sub_noregex_LC":"",
           "sub_temp":"",
           }
    stmt = "{}(hay)"
    setup = "from __main__ import hay,"
    for k, v in foo.items():
        t = Timer(stmt = stmt.format(k), setup = setup+ ','.join([k, v] if v else [k]))
        yield t.timeit(n)

And the generalized test routine

>>> def test(*args, **kwargs):
    n = kwargs['repeat']
    print "{:50}{:^15}{:^15}{:^15}{:^15}".format("Test Case", "sub_temp",
                             "sub_noregex ", "sub_regex",
                             "sub_noregex_LC ")
    for hay in args:
        hay, hay_str = hay
        print "{:50}{:15.10}{:15.10}{:15.10}{:15.10}".format(hay_str, *compare(n, hay))

And the Test Results are as follows

>>> test((' '.join(['steak', 'ghost']*1000), "Multiple repeatation of search key"),
         ('garbage '*998 + 'steak ghost', "Single repeatation of search key at the end"),
         ('steak ' + 'garbage '*998 + 'ghost', "Single repeatation of at either end"),
         ("The scary ghost ordered an expensive steak", "Single repeatation for smaller string"),
         repeat = 100000)
Test Case                                            sub_temp     sub_noregex      sub_regex   sub_noregex_LC 
Multiple repeatation of search key                   0.2022748797   0.3517142003   0.4518992298   0.1812594258
Single repeatation of search key at the end          0.2026047957   0.3508259952   0.4399926194   0.1915298898
Single repeatation of at either end                  0.1877455356   0.3561734007   0.4228843986   0.2164233388
Single repeatation for smaller string                0.2061019057   0.3145984487   0.4252060592   0.1989413449
>>>

Based on the Test Result

Non Regex LC and the temp variable substitution have better performance though the performance of the usage of temp variable is not consistent
LC version has better performance compared to generator (confirmed)
Regex is more than two times slower (so if the piece of code is a bottleneck then the implementation change can be reconsidered)
The Regex and non regex versions are equivalently Robust and can scale

Hmm. I have to admit that I’m puzzled by that. All the non-regex versions perform many more memory allocations. This should be the performance-limiting step. Have you tried pre-compiling the regular expression? Do you know which regular expression engine Python is using? (I’m also appalled that list comprehension is (so much!) faster than generator expressions. Looks like a serious weakness in CPython.)

Collectives™ on Stack Overflow

Python replace function [replace once]

6 Answers 6

10 Comments

10 Comments

8 Comments

4 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

10 Comments

10 Comments

8 Comments

4 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related