Python regex look-behind requires fixed-width pattern

Question

When trying to extract the title of a html-page I have always used the following regex:

(?<=<title.*>)([\s\S]*)(?=</title>)

Which will extract everything between the tags in a document and ignore the tags themselves. However, when trying to use this regex in Python it raises the following Exception:

Traceback (most recent call last):  
File "test.py", line 21, in <module>
    pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)')
File "C:\Python31\lib\re.py", line 205, in compile
    return _compile(pattern, flags)   
File "C:\Python31\lib\re.py", line 273, in _compile
    p = sre_compile.compile(pattern, flags)   File
"C:\Python31\lib\sre_compile.py", line 495, in compile
    code = _code(p, flags)   File "C:\Python31\lib\sre_compile.py", line 480, in _code
_compile(code, p.data, flags)   File "C:\Python31\lib\sre_compile.py", line 115, in _compile
    raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern

The code I am using is:

pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)')
m = pattern.search(f)

if I do some minimal adjustments it works:

pattern = re.compile('(?<=<title>)([\s\S]*)(?=</title>)')
m = pattern.search(f)

This will, however, not take into account potential html titles that for some reason have attributes or similar.

Anyone know a good workaround for this issue? Any tips are appreciated.

Is there some reason it has to be a zero-width assertion? Could you just use a non-capturing group? — Marcelo Cantos
– Marcelo Cantos, Commented Apr 10, 2010 at 11:47
Although you shouldn’t use regular expressions to process HTML. Why do you use look-arounds at all and not something like <title.*>([\s\S]*)</title> and take the match of the first group? — Gumbo
– Gumbo, Commented Apr 10, 2010 at 11:49

Welbog · Accepted Answer · 2010-04-10 11:47:16Z

13

Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.

Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.

answered Apr 10, 2010 at 11:47

Welbog

60.8k9 gold badges114 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Matthew Flaschen Over a year ago

BeautifulSoup (crummy.com/software/BeautifulSoup) is also a good option.

Community · Accepted Answer · 2017-05-23 12:14:16Z

6

Here's a famous answer on parsing html with regular expressions that does a great job of saying, "don't use regex to parse html."

edited May 23, 2017 at 12:14

CommunityBot

11 silver badge

answered Apr 10, 2010 at 13:01

Stephen Harmon

8552 gold badges9 silver badges17 bronze badges

1 Comment

Cerin Over a year ago

Yes and no. You shouldn't use regex to parse an entire DOM, or complicated nestings of tags. However, parsing a single non-nested tag, as the OP is trying to do, is a perfectly legitimate use of regex.

Cerin · Accepted Answer · 2013-03-29 15:02:14Z

4

The regex for extracting the content of non-nested HTML/XML tags is actually very simple:

r = re.compile('<title[^>]*>(.*?)</title>')

However, for anything more complex, you should really use a proper DOM parser like urllib or BeautifulSoup.

answered Mar 29, 2013 at 15:02

Cerin

65.5k106 gold badges347 silver badges561 bronze badges

Comments

Vojta Rylko · Accepted Answer · 2010-04-10 17:22:53Z

3

What about something like:

 r = re.compile("(<title.*>)([\s\S]*)(</title>)")
 title = r.search(page).group(2)

answered Apr 10, 2010 at 17:22

Vojta Rylko

1,4524 gold badges16 silver badges29 bronze badges

Comments

ghostdog74 · Accepted Answer · 2010-04-10 13:04:30Z

2

If you just want to get the title tag,

html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
    if "<title>" in item:
        print item[ item.find("<title>")+7: ]

answered Apr 10, 2010 at 13:04

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

Collectives™ on Stack Overflow

Python regex look-behind requires fixed-width pattern

5 Answers 5

1 Comment

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related