Regex in Python

Question

SO, I am trying create a simple regex that matches the following string:

<PRE>><A HREF="../cgi-bin/hgTracks?hgsid=160564920&db=hg18&position=chrX:33267175-33267784&hgPcrResult=pack">chrX:33267175-33267784</A> 610bp TGATGTTTGGCGAGGAACTC GCAGAGTTTGAAGAGCTCGG
TGATGTTTGGCGAGGAACTCtactattgttacacttaggaaaataatcta
atccaaaggctttgcatctgtacagaagagcgagtagatactgaaagaga
tttgcagatccactgttttttaggcaggaagaatgctcgttaaatgcaaa
cgctgctctggctcatgtgtttgctccgaggtataggttttgttcgactg
acgtatcagatagtcagagtggttaccacaccgacgttgtagcagctgca
taataaatgactgaaagaatcatgttaggcatgcccacctaacctaactt
gaatcatgcgaaaggggagctgttggaattcaaatagactttctggttcc
cagcagtcggcagtaatagaatgctttcaggaagatgacagaatcaggag
aaagatgctgttttgcactatcttgatttgttacagcagccaacttattg
gcatgatggagtgacaggaaaaacagctggcatggaaggtaggattatta
aagctattacatcattacaaatacaattagaagctggccatgacaaagca
tatgtttgaacaagcagctgttggtagctggggtttgttgCCGAGCTCTT
CAAACTCTGC
</PRE>

I have created the following regex:

<PRE>[.|[\n]]*</PRE>

yet it won't match the string above. Does anyone have a solution to this conundrum and perhaps a reasoning as toward why this doesn't work.

Sorry about the formatting of this question.

Are you trying to just match that exact string type, or do you want to pull pieces of the string out? — ABach
– ABach, Commented Jun 2, 2010 at 19:05
You have newlines in your string, so don't you need some "match across multiple lines" flag? — user354134
– user354134, Commented Jun 2, 2010 at 19:06
what dont you understand? he just wants the string between and including the <PRE> tags — Alex Gordon
– Alex Gordon, Commented Jun 2, 2010 at 19:07
Were you attempting to match the string with the <PRE> tag in it, or was that only meant to be used for formatting? — John Rasch
– John Rasch, Commented Jun 2, 2010 at 19:08
@every_answer: is there really a need to be so snarky? I was clarifying the OP's question; that doesn't make me an idiot. — ABach
– ABach, Commented Jun 2, 2010 at 19:11

Community · Accepted Answer · 2017-05-23 12:01:12Z

2

Stop trying to parse HTML using regexes. You can't do it (robustly). There's a reason there's this famous SO answer. Use lxml instead.

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Jun 2, 2010 at 19:08

Hank Gay

72.4k36 gold badges164 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ofri Raviv · Accepted Answer · 2010-06-02 22:38:08Z

1

If you're going to parse HTML, please use lxml, as Hank proposed.

But for this regex to work, you need to change the [] to (). A | inside square brackets is interpreted as the symbol '|' and not as an OR operator.

Another option is to use the flag that's called DOTALL, which makes the dot operator match anything, including a newline. This way the regex becomes very simple:

m = re.match(r'<PRE>(.*)</PRE>', input_string, re.DOTALL)
m.group(1)

outputs the string inside the PRE, without the < PRE >and< /PRE > themselves.

answered Jun 2, 2010 at 22:38

Ofri Raviv

24.9k3 gold badges59 silver badges55 bronze badges

Comments

Ethan Furman · Accepted Answer · 2011-09-07 12:53:00Z

0

The issue is that inside []'s the . is a period, not a match-anything dot; the | is a pipe, not an or; and the [ and ] are braces, not character-class creators -- in other words, the non-backslash special symbols lose their specialness.

What you will want to do is this:

m = re.search(r'(<PRE>.*</PRE>)', input_string, re.DOTALL)
m.group(1)

.search() will look everywhere in the string for the match (.match() only checks the beginning of the string), and re.DOTALL (or re.S) will have the . match newlines as well.

If you don't want the <PRE> and </PRE> tags included, move the parentheses to surround the .*.

answered Sep 7, 2011 at 12:53

Ethan Furman

70.1k21 gold badges174 silver badges251 bronze badges

Collectives™ on Stack Overflow

Regex in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related