python regular expression replace

Question

I'm trying to change a string that contains substrings such as

the</span></p>
<p><span class=font7>currency

to

the currency

At the line break is CRLF

The words before and after the code change. I only want to replace if the second word starts with a lower case letter. The only thing that changes in the code is the digit after 'font'

I tried:

p = re.compile('</span></p>\r\n<p><span class=font\d>([a-z])')
res = p.sub(' \1', data)

but this isn't working

How should I fix this?

In addition to @JBernardo your question is unclear. Such as? So what else? — FailedDev
– FailedDev, Commented Oct 15, 2011 at 23:48
@JBernardo: what would you suggest, especially to replace only if the second word starts with a lower case letter? — foosion
– foosion, Commented Oct 15, 2011 at 23:56
How about putting python strip html into the search box and pressing enter? The first hit will get you this nifty anser. — ekhumoro
– ekhumoro, Commented Oct 16, 2011 at 0:05

Peter Graham · Accepted Answer · 2011-10-16 00:00:16Z

1

Use a lookahead assertion.

p = re.compile('</span></p>\r\n<p><span class=font\d>(?=[a-z])')
res = p.sub(' ', data)

answered Oct 16, 2011 at 0:00

Peter Graham

11.8k8 gold badges42 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

foosion Over a year ago

That does not seem to find the substring. p.match(data) returns nothing

Peter Graham Over a year ago

@foosion: match searches from the start of the string. Try p.search(data).

heltonbiker · Accepted Answer · 2011-10-16 00:03:25Z

1

I think you should use the flag re.DOTALL, which means it will "see" nonprintable characters, such as linebreaks, as if they were regular characters.

So, first line of your code would become :

p = re.compile('</span></p>..<p><span class=font\d>([a-z])', re.DOTALL)

(not the two unescaped dots instead of the linebreak).

Actually, there is also re.MULTILINE, everytime I have a problem like this one of those end up solving the problem.

Hope it helps.

answered Oct 16, 2011 at 0:03

heltonbiker

27.7k30 gold badges151 silver badges270 bronze badges

1 Comment

foosion Over a year ago

That does not seem to find the substring. p.match(data) returns nothing.

FailedDev · Accepted Answer · 2011-10-16 00:25:28Z

1

This :

result = re.sub("(?si)(.*?)</?[A-Z][A-Z0-9]*[^>]*>.*</?[A-Z][A-Z0-9]*[^>]*>(.*)", r"\1 \2", subject)

Applied to :

the</span></p>
<p><span class=font7>currency

Produces :

the currency

Although I would strongly suggest against using regex with xml/html/xhtml. THis generic regex will remove all elements and capture any text before / after to groups 1,2.

edited Oct 16, 2011 at 0:25

answered Oct 16, 2011 at 0:02

FailedDev

27k9 gold badges56 silver badges74 bronze badges

2 Comments

foosion Over a year ago

The reason the html seems reversed is that you're seeing the end of one paragraph and the beginning of another. That regex seems very complex

FailedDev Over a year ago

@foosion Well this regex will work regardless of any elements you throw in it.

Collectives™ on Stack Overflow

python regular expression replace

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related