Replace previous occurrence of string

Question

I want to remove duplicated words inside brackets and replace them with "S" + word.

For eg:

(Skipper Skipper) -> (S Skipper)
('s 's) -> (S 's)

Here is the string, s:

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

Expected result:

out = "(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own)))))) 
       (S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"

I tried to do:

from collections import Counter

lst = s.lstrip("(").rstrip(")").replace("(", "").replace(")", "").split()
d = Counter(lst)
mapper = {((k + " ") * v).strip():"S" + " " + k for k, v in d.items()}
for k, v in mapper.items():
    out = s.replace(k, v)

But not getting quite right:

out = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (it it) (S (S signed) (S (a a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (for for) (S (S (a a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) (S (S does) (S (S n't) (S own)))))) 
       (S (for for) (S (S (S 11.50) (S (a a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"

There are only duplicates inside the brackets, is it always like this? — Alexander Riedel
– Alexander Riedel, Commented Nov 30, 2020 at 8:44

alex_noname · Accepted Answer · 2020-11-30 09:47:26Z

1

You can use re.sub and backreferences in regular expression.

For finding duplicate words you can use \1 that references the captured match of the first group, and \g<1> to reference it in repl argument. Like so:

res = re.sub(r"([\w.'%]+)\s+\1", r"S \g<1>", s)

edited Nov 30, 2020 at 9:47

answered Nov 30, 2020 at 9:21

alex_noname

33.2k6 gold badges95 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

AnsFourtyTwo · Accepted Answer · 2020-11-30 08:53:48Z

You might want to look into regular expressions here. I've created a demo which will match all inner brackets.

Having those, you can analyize the content for each of those matches and replace it according to your requirements:

import re

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) \
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) \
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) \
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) \
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) \
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) \
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) \
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

# Finding all inner brackets:
# - (Skipper Skipper)
# - ('s 's)
# - etc.
double_words = re.findall(r"(\((?:\(??[^\(]*?\)))", s)


for double_word in double_words:
    words = double_word.lstrip("(").rstrip(")").split()
    # First and second word are the same
    if words[0]==words[1]:
        # Replace ('s 's) with (S 's)
        s = s.replace(double_word, f'(S {words[0]})')
        
print(s)

Output

(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.)))      (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive)      (S (S merger) (S (S agreement) (S (S for) (S (S (S a)      (S (S National) (S (S Pizza) (S (S Corp.) (S unit)))))      (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %)))      (S (S (S of) (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it)      (S (S does) (S (S n't) (S own)))))) (S (S for) (S (S (S 11.50)      (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))

jizhihaoSAMA · Accepted Answer · 2020-11-30 08:56:04Z

Use re.sub to replace them:

import re

def sub(matched):
    return f"(S {matched.group(2)})" if matched.group(1) == matched.group(2) else str(matched.groups())

s = '''(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))'''

result = re.sub(r"\(([\.\%\'\w\d]+) ([\.\%\'\w\d]+)\)", sub, s)

Alexander Riedel · Accepted Answer · 2020-11-30 08:56:13Z

There's this solution iterating through the list of words, finding duplicates and replacing the first occurency of each duplicate wirh "S"

s = """(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"""

word_list = s.split()

for word, next_word in zip(word_list, word_list[1:]):
    if word.replace('(', '').replace(')', '') == next_word.replace('(', '').replace(')', ''):
        word_list[word_list.index(word)] = "(S"
        

s_new = " ".join(word_list)

Collectives™ on Stack Overflow

Replace previous occurrence of string

4 Answers 4

Comments

Output

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Output

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related