Why won't Python regex work on a formatted string of HTML?

Question

from bs4 import BeautifulSoup
import urllib
import re

soup = urllib.urlopen("http://atlanta.craigslist.org/cto/")
soup = BeautifulSoup(soup)
souped = soup.p
print souped
m = re.search("\\$.",souped)
print m.group(0)

I can download and print out the html just fine, but it always breaks when I add the last two lines.

I get this error:

Traceback (most recent call last):
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
    debugger.run(codeObject, __main__.__dict__, start_stepping=0)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 60, in run
    _GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 655, in run
    exec cmd in globals, locals
  File "C:\Users\Zack\Documents\Scripto.py", line 1, in <module>
    from bs4 import BeautifulSoup
  File "C:\Python27\lib\re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

Thanks lots!

Roman Bodnarchuk · Accepted Answer · 2012-02-25 17:33:50Z

6

You probably want re.search("\\$.", str(souped)).

answered Feb 25, 2012 at 17:33

Roman Bodnarchuk

29.8k12 gold badges62 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kindall Over a year ago

To expand on this, BeautifulSoup objects have a __str__() method to convert them to strings, so they can be printed nicely (because print will do that automatically), but they are not actually strings, and re.search() wants a string. Hence you must explicitly convert the HTML to a string so you can search it.

Bite code Over a year ago

+1, but I would use unicode(), not str, if possible. And add the re.U flag.

Zsolt Botykai · Accepted Answer · 2012-02-25 17:37:32Z

1

Because souped is an object and printing it converts it to text. But if you want to use it in another context (like you do, as text), you should convert it first like str(souped) or unicode(souped) if it's a unicode string.

answered Feb 25, 2012 at 17:37

Zsolt Botykai

52k14 gold badges90 silver badges111 bronze badges

Comments

jfs · Accepted Answer · 2012-02-25 17:54:59Z

You could pass a regex as search criteria to .find() method:

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen # from urllib.request import urlopen
>>> import re
>>> page = urlopen("http://atlanta.craigslist.org/cto/")
>>> soup = BeautifulSoup(page)
>>> soup.find('p', text=re.compile(r"\$."))
' -\n\t\t\t $7500'

soup.p returns a Tag object. You could use str() or unicode() to convert it to string:

>>> p = soup.p
>>> str(p)
'<p class="row">\n<span class="ih" id="images:5Nb5I85J83N73p33H6
c2pd3447d5bff6d1757.jpg">\xa0</span>\n<a href="http://atlanta.cr
aigslist.org/nat/cto/2870295634.html">2000 Lexus RX 300</a> -\n\
t\t\t $7500<font size="-1"> (Buford)</font> <span class="p"> pic
\xa0img</span><br class="c" />\n</p>'
>>> re.search(r"\$.", str(p)).group(0)
'$7'

Collectives™ on Stack Overflow

Why won't Python regex work on a formatted string of HTML?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest