4
from bs4 import BeautifulSoup
import urllib
import re

soup = urllib.urlopen("http://atlanta.craigslist.org/cto/")
soup = BeautifulSoup(soup)
souped = soup.p
print souped
m = re.search("\\$.",souped)
print m.group(0)

I can download and print out the html just fine, but it always breaks when I add the last two lines.

I get this error:

Traceback (most recent call last):
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
    debugger.run(codeObject, __main__.__dict__, start_stepping=0)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 60, in run
    _GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
  File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 655, in run
    exec cmd in globals, locals
  File "C:\Users\Zack\Documents\Scripto.py", line 1, in <module>
    from bs4 import BeautifulSoup
  File "C:\Python27\lib\re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

Thanks lots!

3 Answers 3

6

You probably want re.search("\\$.", str(souped)).

Sign up to request clarification or add additional context in comments.

2 Comments

To expand on this, BeautifulSoup objects have a __str__() method to convert them to strings, so they can be printed nicely (because print will do that automatically), but they are not actually strings, and re.search() wants a string. Hence you must explicitly convert the HTML to a string so you can search it.
+1, but I would use unicode(), not str, if possible. And add the re.U flag.
1

Because souped is an object and printing it converts it to text. But if you want to use it in another context (like you do, as text), you should convert it first like str(souped) or unicode(souped) if it's a unicode string.

Comments

1

You could pass a regex as search criteria to .find() method:

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen # from urllib.request import urlopen
>>> import re
>>> page = urlopen("http://atlanta.craigslist.org/cto/")
>>> soup = BeautifulSoup(page)
>>> soup.find('p', text=re.compile(r"\$."))
' -\n\t\t\t $7500'

soup.p returns a Tag object. You could use str() or unicode() to convert it to string:

>>> p = soup.p
>>> str(p)
'<p class="row">\n<span class="ih" id="images:5Nb5I85J83N73p33H6
c2pd3447d5bff6d1757.jpg">\xa0</span>\n<a href="http://atlanta.cr
aigslist.org/nat/cto/2870295634.html">2000 Lexus RX 300</a> -\n\
t\t\t $7500<font size="-1"> (Buford)</font> <span class="p"> pic
\xa0img</span><br class="c" />\n</p>'
>>> re.search(r"\$.", str(p)).group(0)
'$7'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.