I'm looking for a regexp to identify this for block in a template so I can provide text to replace this whole block
<div>
{% for link in links %}
textext
{% endfor %}
</div>
and get something like this
<div>
mytext
</div>
Try:
re.sub('\{.*[\w\s]*.*\}','mytext',txt)
Output:
'<div>\n mytext\n</div>'
\{ matches the first brace, then .*[\w\s]*.* matches all of the rest (including whitespace and newlines) until the last brace \}.
You can be more specific with something like:
re.sub('\{% for link in links.*[\w\s]*.*end for %\}','mytext',txt)
and then you can be sure it will only match a for loop of the type you specified.
EDIT: eyquem pointed out that my answer was insufficient for a number of cases, specifically if it has symbols in the middle. At the risk of naively misunderstanding why my solution did not work, I simply added an extra bit to my pattern that successfully matches even his test cases, so we'll see if it works:
re.sub('\{.*[\W\w\s]*.*\}', 'mytext', txt)
RESULT (where txt is eyquems's Pink Floyd example):
"Pink Floyd"
<div>
mytext
</div>
"Fleetwood Mac"
So, I think the addition of all non-alphanumeric symbols fixes it. Or I may have broken it even more obviously for another case. I'm sure someone will point it out. :)'
EDIT2: It should also be noted that both of our solutions fail in the case that there is more than one for-loop on the page. Example:
"Beatles"
<div>
{% for link in links %}
iiiY=uuu
12345678
{% endfor %}
</div>
"Tino Rossi"
{ for link in links % }
asdfasdfas
{% endfor% }
yields
"Beatles"
<div>
mytext
And cut's out the rest by matching the next set AFTER the .
EDIT 2: eyquem is right again in fixing his to not cut out the if there is one after. His fix fixes mine as well:
re.sub('\{.*[\W\w\s]*?.*\}', 'mytext', txt)
is the new pattern.
? sign after '.+' to make the regex ungreedy) and now it works perfectly well without having the flaw of Logan's one, sorry.} placed in the block between {% for link in links %} and {% endfor %}. You should execute with groups in the pattern, to see how it functions : '(\{.*)([\W\w\s]*?)(.*\})'I regret to say that Logan's anwer doesn't work in the following cases:
import re
ss1 = '''"Pink Floyd"
<div>
{% for link in links %}
aaaY}eee
12345678
{% endfor %}
</div>
"Fleetwood Mac"'''
pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss1
print '---------------------------'
for el in re.findall(pat,ss1):
print el
print '---------------------------'
print re.sub(pat,':::::',ss1)
RESULT
"Pink Floyd"
<div>
{% for link in links %}
aaaY}eee # <--------- } here
12345678
{% endfor %}
</div>
"Fleetwood Mac"
---------------------------
('{% for link in links %}', '\n aaaY', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Pink Floyd"
<div>
:::::eee
12345678
:::::
</div>
"Fleetwood Mac"
.
.
import re
ss2 = '''"Beatles"
<div>
{% for link in links %}
iiiY=uuu # <-------- = here
12345678
{% endfor %}
</div>
"Tino Rossi"'''
pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss2
print '---------------------------'
for el in re.findall(pat,ss2):
print el
print '---------------------------'
print re.sub(pat,':::::',ss2)
RESULT
"Beatles"
<div>
{% for link in links %}
iiiY=uuu
12345678
{% endfor %}
</div>
"Tino Rossi"
---------------------------
('{% for link in links %', '', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Beatles"
<div>
:::::
iiiY=uuu
12345678
:::::
</div>
"Tino Rossi"
The problem is the following (results of findall() put in my code help to understand):
The first .* runs as long as it doesn't encounters a newline.
Then [\w\s]* runs as long as there are characters of these categories: letters,digits,underscore,whitespaces.
Among whitespaces are the newlines, then [\w\s]* can runs passing from one line to the next one.
But if a character being not in these categories is encountered by [\w\s]* , it stops at this character.
If it is a }, the last .* matches '' before this } .
Then the regex searches for the next match.
If it is a = , the last .* can't match the suite of the text before reaching the next } because it can't pass the next newline. Hence the different result than with a } in the text.
.
Replacing .* with .+ doesn't change anything as it will be seen by replacing .* with .+ in the above codes.
.
.
MY SOLUTION
I propose the patern in this code:
import re
pat = ('\{%[^\r\n]+%\}'
'.+?'
'\{%[^\r\n]+%\}')
ss = '''"Pink Floyd"
<div>
{% for link in links %}
aaaY}eee
12345678
{% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
{% for link in links %}
iiiY=uuu
12345678
{% endfor %}
</div>
"Tino Rossi"'''
print '\n',ss,'\n\n---------------------------\n'
print re.sub(pat,':::::',ss,flags=re.DOTALL)
resulting in
"Pink Floyd"
<div>
{% for link in links %}
aaaY}eee
12345678
{% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
{% for link in links %}
iiiY=uuu
12345678
{% endfor %}
</div>
"Tino Rossi"
---------------------------
"Pink Floyd"
<div>
:::::
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
:::::
</div>
"Tino Rossi"
EDIT
Simpler:
pat = ('\{%[^}]+%\}'
'.+?'
'\{%[^}]+%\}')
only if the lignes {%.....%} don't contain the signe }
'.+' with '.+?' and it works perfectly wellThe sledgehammer approach would be:
In [540]: txt = """<div>
{% for link in links %}
textext
{% endfor %}
</div>"""
In [541]: txt
Out[541]: '<div>\n {% for link in links %}\n textext\n {% endfor %}\n</div>'
In [542]: re.sub("(?s)<div>.*?</div>", "<div>mytext</div>", txt)
Out[542]: '<div>mytext</div>'
lxml or one of the other parsers would be a better idea as you can narrow down which div you're talking about using Xpath. Told you it was a sledgehammer, you need a scalpel. Xml/Html are structured documents for a reason.(?s) in the regex's pattern. I tried codes and I've understood that it is equivalent to the flag re.DOTALL. But I've never seen any information concerning this (?s) in the Python documentations. It doesn't belongs to Python 3 only because it functions in my Python 2 codes. Where can I find documentation giving info on this, please ?
?after.*to make the regex ungreedy, and to put the flagre.DOTALLto make the dot able to match the newlines\n. - Now, you should look my solution, in which I used[^\r\n]to obtain a symbol that still doesn't match the ends of lines even within a DOTALL context, thus generalizing the regex to varied form of blocks{%....%}- Finally you should consider if the answer of Logan really deserves to be accepted. I dare to say I don't think so