0

I'm looking for a regexp to identify this for block in a template so I can provide text to replace this whole block

<div>
 {% for link in links %}
     textext
 {% endfor %}
</div>

and get something like this

<div>
 mytext
</div>
2
  • I tried this re.sub('{% for link in links %}.*{% endfor %}', 'mytest', stringHTML) Commented Feb 11, 2013 at 18:14
  • @Zed What you said you tried was nearly the right way. It only needed to add a ? after .* to make the regex ungreedy, and to put the flag re.DOTALL to make the dot able to match the newlines \n. - Now, you should look my solution, in which I used [^\r\n] to obtain a symbol that still doesn't match the ends of lines even within a DOTALL context, thus generalizing the regex to varied form of blocks {%....%} - Finally you should consider if the answer of Logan really deserves to be accepted. I dare to say I don't think so Commented Feb 15, 2013 at 14:50

3 Answers 3

1

Try:

re.sub('\{.*[\w\s]*.*\}','mytext',txt)

Output:

'<div>\n mytext\n</div>'

\{ matches the first brace, then .*[\w\s]*.* matches all of the rest (including whitespace and newlines) until the last brace \}.

You can be more specific with something like:

re.sub('\{% for link in links.*[\w\s]*.*end for %\}','mytext',txt)

and then you can be sure it will only match a for loop of the type you specified.

EDIT: eyquem pointed out that my answer was insufficient for a number of cases, specifically if it has symbols in the middle. At the risk of naively misunderstanding why my solution did not work, I simply added an extra bit to my pattern that successfully matches even his test cases, so we'll see if it works:

re.sub('\{.*[\W\w\s]*.*\}', 'mytext', txt)

RESULT (where txt is eyquems's Pink Floyd example):

"Pink Floyd"
<div>
 mytext
</div>
"Fleetwood Mac"

So, I think the addition of all non-alphanumeric symbols fixes it. Or I may have broken it even more obviously for another case. I'm sure someone will point it out. :)'

EDIT2: It should also be noted that both of our solutions fail in the case that there is more than one for-loop on the page. Example:

"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"
{ for link in links % }
   asdfasdfas
{% endfor% }

yields

"Beatles"
<div>
 mytext

And cut's out the rest by matching the next set AFTER the .

EDIT 2: eyquem is right again in fixing his to not cut out the if there is one after. His fix fixes mine as well:

re.sub('\{.*[\W\w\s]*?.*\}', 'mytext', txt)

is the new pattern.

Sign up to request clarification or add additional context in comments.

2 Comments

@Logan @Zed I updated my solution (minimal change to do: adding a ? sign after '.+' to make the regex ungreedy) and now it works perfectly well without having the flaw of Logan's one, sorry.
@Logan No it still doesn't works when there is a sign } placed in the block between {% for link in links %} and {% endfor %}. You should execute with groups in the pattern, to see how it functions : '(\{.*)([\W\w\s]*?)(.*\})'
1

I regret to say that Logan's anwer doesn't work in the following cases:

import re

ss1 = '''"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"'''

pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss1
print '---------------------------'
for el in re.findall(pat,ss1):
    print el
print '---------------------------'
print re.sub(pat,':::::',ss1)

RESULT

"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee  # <--------- } here
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
---------------------------
('{% for link in links %}', '\n    aaaY', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Pink Floyd"
<div>
 :::::eee
    12345678
 :::::
</div>
"Fleetwood Mac"

.
.

import re

ss2 = '''"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu  # <-------- = here
    12345678
 {% endfor %}
</div>
"Tino Rossi"'''

pat = '(\{.*)([\w\s]*)(.*)(\})'
print ss2
print '---------------------------'
for el in re.findall(pat,ss2):
    print el
print '---------------------------'
print re.sub(pat,':::::',ss2)

RESULT

"Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"
---------------------------
('{% for link in links %', '', '', '}')
('{% endfor %', '', '', '}')
---------------------------
"Beatles"
<div>
 :::::
    iiiY=uuu
    12345678
 :::::
</div>
"Tino Rossi"

The problem is the following (results of findall() put in my code help to understand):

The first .* runs as long as it doesn't encounters a newline.
Then [\w\s]* runs as long as there are characters of these categories: letters,digits,underscore,whitespaces.
Among whitespaces are the newlines, then [\w\s]* can runs passing from one line to the next one.
But if a character being not in these categories is encountered by [\w\s]* , it stops at this character.

If it is a }, the last .* matches '' before this } .
Then the regex searches for the next match.

If it is a = , the last .* can't match the suite of the text before reaching the next } because it can't pass the next newline. Hence the different result than with a } in the text.

.

Replacing .* with .+ doesn't change anything as it will be seen by replacing .* with .+ in the above codes.

.

.

MY SOLUTION

I propose the patern in this code:

import re
pat = ('\{%[^\r\n]+%\}'
       '.+?'
       '\{%[^\r\n]+%\}')


ss = '''"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi"'''


print '\n',ss,'\n\n---------------------------\n'
print re.sub(pat,':::::',ss,flags=re.DOTALL)

resulting in

"Pink Floyd"
<div>
 {% for link in links %}
    aaaY}eee
    12345678
 {% endfor %}
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 {% for link in links %}
    iiiY=uuu
    12345678
 {% endfor %}
</div>
"Tino Rossi" 

---------------------------

"Pink Floyd"
<div>
 :::::
</div>
"Fleetwood Mac"
"Beth Hart"
"Jimmy Cliff"
"Led Zepelin"
Beatles"
<div>
 :::::
</div>
"Tino Rossi"

EDIT

Simpler:

pat = ('\{%[^}]+%\}'
       '.+?'
       '\{%[^}]+%\}')

only if the lignes {%.....%} don't contain the signe }

6 Comments

Well, that's unfortunate. :( Good answer, though! I didn't want to make it too complicated (lazy :)) but I guess it didn't cut it this time.
@Logan May I put a comment for the OP to signal him my answer ? The risk is for you, that he changes his acceptance of answer.
Well, I was going to do it myself, except that I think I fixed my answer. I am going to update mine and let's see what you think.
@Logan I updated mine too: I replaced '.+' with '.+?' and it works perfectly well
Honestly, I think our solutions are too similar. It comes down to preference in which expressions to use. Your fix fixes mine as well. You match the whole start of the for loop and the whole end tag whereas I simply match the first and last brace of each, respectively.
|
0

The sledgehammer approach would be:

In [540]: txt = """<div>
 {% for link in links %}
     textext
 {% endfor %}
</div>"""

In [541]: txt
Out[541]: '<div>\n {% for link in links %}\n     textext\n {% endfor %}\n</div>'

In [542]: re.sub("(?s)<div>.*?</div>", "<div>mytext</div>", txt)
Out[542]: '<div>mytext</div>'

5 Comments

thanks but my mistake is that I didn't say, sometimes there will be div around for block and sometimes not, I can't know that
That's why using regexp on these sort of things is usually a bad idea. Using lxml or one of the other parsers would be a better idea as you can narrow down which div you're talking about using Xpath. Told you it was a sledgehammer, you need a scalpel. Xml/Html are structured documents for a reason.
@sotapme I didn't know the notation (?s) in the regex's pattern. I tried codes and I've understood that it is equivalent to the flag re.DOTALL. But I've never seen any information concerning this (?s) in the Python documentations. It doesn't belongs to Python 3 only because it functions in my Python 2 codes. Where can I find documentation giving info on this, please ?
Search for (?iLmsux) regular-expression-syntax
@sotapme ...aaaaaaaAhh ! ...oooOOOK !! Thank you. I hadn't understood when I read it in the first times of my learning the regexes, and then I never stopped again at this point of doc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.