python replace string with regexp

Question

I'm looking for a regexp to identify this for block in a template so I can provide text to replace this whole block

<div>
 {% for link in links %}
     textext
 {% endfor %}
</div>

and get something like this

<div>
 mytext
</div>

I tried this re.sub('{% for link in links %}.*{% endfor %}', 'mytest', stringHTML) — Zed
– Zed, Commented Feb 11, 2013 at 18:14
@Zed What you said you tried was nearly the right way. It only needed to add a ? after .* to make the regex ungreedy, and to put the flag re.DOTALL to make the dot able to match the newlines \n. - Now, you should look my solution, in which I used [^\r\n] to obtain a symbol that still doesn't match the ends of lines even within a DOTALL context, thus generalizing the regex to varied form of blocks {%....%} - Finally you should consider if the answer of Logan really deserves to be accepted. I dare to say I don't think so — eyquem
– eyquem, Commented Feb 15, 2013 at 14:50

Logan · Accepted Answer · 2013-02-15 20:35:19Z

1

Try:

re.sub('\{.*[\w\s]*.*\}','mytext',txt)

Output:

'<div>\n mytext\n</div>'

\{ matches the first brace, then .*[\w\s]*.* matches all of the rest (including whitespace and newlines) until the last brace \}.

You can be more specific with something like:

re.sub('\{% for link in links.*[\w\s]*.*end for %\}','mytext',txt)

and then you can be sure it will only match a for loop of the type you specified.

EDIT: eyquem pointed out that my answer was insufficient for a number of cases, specifically if it has symbols in the middle. At the risk of naively misunderstanding why my solution did not work, I simply added an extra bit to my pattern that successfully matches even his test cases, so we'll see if it works:

re.sub('\{.*[\W\w\s]*.*\}', 'mytext', txt)

RESULT (where txt is eyquems's Pink Floyd example):

"Pink Floyd"
<div>
 mytext
</div>
"Fleetwood Mac"

So, I think the addition of all non-alphanumeric symbols fixes it. Or I may have broken it even more obviously for another case. I'm sure someone will point it out. :)'

EDIT2: It should also be noted that both of our solutions fail in the case that there is more than one for-loop on the page. Example:

"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"
{ for link in links % }
   asdfasdfas
{% endfor% }

yields

"Beatles"
<div>
 mytext

And cut's out the rest by matching the next set AFTER the .

EDIT 2: eyquem is right again in fixing his to not cut out the if there is one after. His fix fixes mine as well:

re.sub('\{.*[\W\w\s]*?.*\}', 'mytext', txt)

is the new pattern.

edited Feb 15, 2013 at 20:35

answered Feb 11, 2013 at 18:24

Logan

1,6922 gold badges14 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

eyquem Over a year ago

@Logan @Zed I updated my solution (minimal change to do: adding a ? sign after '.+' to make the regex ungreedy) and now it works perfectly well without having the flaw of Logan's one, sorry.

eyquem Over a year ago

@Logan No it still doesn't works when there is a sign } placed in the block between {% for link in links %} and {% endfor %}. You should execute with groups in the pattern, to see how it functions : '(\{.*)([\W\w\s]*?)(.*\})'

eyquem · Accepted Answer · 2013-02-15 21:25:26Z

1

I regret to say that Logan's anwer doesn't work in the following cases:

import re

ss1 = '''"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"'''

pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss1
print '---------------------------'
for el in re.findall(pat,ss1):
    print el
print '---------------------------'
print re.sub(pat,':::::',ss1)

RESULT

"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee  # <--------- } here
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
---------------------------
('{% for link in links %}', '\n    aaaY', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Pink Floyd"
<div>
 :::::eee
    12345678
 :::::
</div>
"Fleetwood Mac"

.
.

import re

ss2 = '''"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu  # <-------- = here
    12345678
 {% endfor %}
</div>
"Tino Rossi"'''

pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss2
print '---------------------------'
for el in re.findall(pat,ss2):
    print el
print '---------------------------'
print re.sub(pat,':::::',ss2)

RESULT

"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"
---------------------------
('{% for link in links %', '', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Beatles"
<div>
 :::::
    iiiY=uuu
    12345678
 :::::
</div>
"Tino Rossi"

The problem is the following (results of findall() put in my code help to understand):

The first .* runs as long as it doesn't encounters a newline.
Then [\w\s]* runs as long as there are characters of these categories: letters,digits,underscore,whitespaces.
Among whitespaces are the newlines, then [\w\s]* can runs passing from one line to the next one.
But if a character being not in these categories is encountered by [\w\s]* , it stops at this character.

If it is a }, the last .* matches '' before this } .
Then the regex searches for the next match.

If it is a = , the last .* can't match the suite of the text before reaching the next } because it can't pass the next newline. Hence the different result than with a } in the text.

.

Replacing .* with .+ doesn't change anything as it will be seen by replacing .* with .+ in the above codes.

.

MY SOLUTION

I propose the patern in this code:

import re
pat = ('\{%[^\r\n]+%\}'
       '.+?'
       '\{%[^\r\n]+%\}')


ss = '''"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"'''


print '\n',ss,'\n\n---------------------------\n'
print re.sub(pat,':::::',ss,flags=re.DOTALL)

resulting in

"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi" 

---------------------------

"Pink Floyd"
<div>
 :::::
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 :::::
</div>
"Tino Rossi"

EDIT

Simpler:

pat = ('\{%[^}]+%\}'
       '.+?'
       '\{%[^}]+%\}')

only if the lignes {%.....%} don't contain the signe }

edited Feb 15, 2013 at 21:25

answered Feb 11, 2013 at 22:22

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

6 Comments

Logan Over a year ago

Well, that's unfortunate. :( Good answer, though! I didn't want to make it too complicated (lazy :)) but I guess it didn't cut it this time.

eyquem Over a year ago

@Logan May I put a comment for the OP to signal him my answer ? The risk is for you, that he changes his acceptance of answer.

Logan Over a year ago

Well, I was going to do it myself, except that I think I fixed my answer. I am going to update mine and let's see what you think.

eyquem Over a year ago

@Logan I updated mine too: I replaced '.+' with '.+?' and it works perfectly well

Logan Over a year ago

Honestly, I think our solutions are too similar. It comes down to preference in which expressions to use. Your fix fixes mine as well. You match the whole start of the for loop and the whole end tag whereas I simply match the first and last brace of each, respectively.

|

sotapme · Accepted Answer · 2013-02-11 18:21:58Z

0

The sledgehammer approach would be:

In [540]: txt = """<div>
 {% for link in links %}
     textext
 {% endfor %}
</div>"""

In [541]: txt
Out[541]: '<div>\n {% for link in links %}\n     textext\n {% endfor %}\n</div>'

In [542]: re.sub("(?s)<div>.*?</div>", "<div>mytext</div>", txt)
Out[542]: '<div>mytext</div>'

answered Feb 11, 2013 at 18:21

sotapme

4,9432 gold badges21 silver badges21 bronze badges

5 Comments

Zed Over a year ago

thanks but my mistake is that I didn't say, sometimes there will be div around for block and sometimes not, I can't know that

sotapme Over a year ago

That's why using regexp on these sort of things is usually a bad idea. Using lxml or one of the other parsers would be a better idea as you can narrow down which div you're talking about using Xpath. Told you it was a sledgehammer, you need a scalpel. Xml/Html are structured documents for a reason.

eyquem Over a year ago

@sotapme I didn't know the notation (?s) in the regex's pattern. I tried codes and I've understood that it is equivalent to the flag re.DOTALL. But I've never seen any information concerning this (?s) in the Python documentations. It doesn't belongs to Python 3 only because it functions in my Python 2 codes. Where can I find documentation giving info on this, please ?

sotapme Over a year ago

Search for (?iLmsux) regular-expression-syntax

eyquem Over a year ago

@sotapme ...aaaaaaaAhh ! ...oooOOOK !! Thank you. I hadn't understood when I read it in the first times of my learning the regexes, and then I never stopped again at this point of doc.

Collectives™ on Stack Overflow

python replace string with regexp

3 Answers 3

2 Comments

6 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related