2

I am trying to scrape a webpage in python. I was able to easily get the results for tags which were on a single line, but for tags spread over multiple lines, my code cannot retrieve anything.

In the HTML source single line tags are present as:

<td><span class="facultyName">John Matthew Falletta, MD</span>

and multiple line tags are present as:

<td><span class="label">Division:</span>
            &nbsp;&nbsp;
                  </td><td>Hematology/Oncology</td>

Here is what I wrote:

patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')

fullname = re.findall(patFinderFullname,webpage)         #works fine

patFinderDivision = re.compile('<span class="label">Division:</span>&nbsp;&nbsp;</td><td>(.*)</td>')

division = re.findall(patFinderDivision,webpage)       #doesn't work

Here my webpage variable contains the url which has to be scraped. Can someone point out, what I am missing, or where I am wrong?

5
  • 2
    Don't post an image of text. Post the text so someone can cut/paste if they want to work with it. Commented Feb 15, 2013 at 5:02
  • My text contains html tags, it is automatically being formatted by the editior during posting. Commented Feb 15, 2013 at 5:05
  • The source of your post contains a .jpg. Commented Feb 15, 2013 at 5:06
  • It was being formatted, that's why I posted a jpg, so that it remains exactly what I wrote. Commented Feb 15, 2013 at 5:07
  • Use Ctrl-K to mark the text as code and it won't format. Commented Feb 15, 2013 at 5:08

2 Answers 2

5

I highly recommend you use BeautifulSoup. It is a Python library for parsing HTML documents.

P.s: If you want to stick with your own code, use \s* to skip white spaces in regex.

patFinderDivision = re.compile('<span class="label">Division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*)</td>')
Sign up to request clarification or add additional context in comments.

5 Comments

Can you suggest how to read text from multiple line tags,like in my example I want "Hematology/Oncology" to be extracted in the variable division.
try \s+ to avoid the white spaces in regex
I got it correct for division but now Address field is baffling me :<td><span class="label">Address:</span></td><td>Box 2991<br>DUMC<br>Durham, NC &nbsp;27710 </td> Is this the correct regex for this patFinderAddress = re.compile(' <span class="label">Address:</span>\s+(.*)\s+</td>')
Posting the code as a comment is not a good idea. I cannot see the new lines this way so that I can tell you where to put the "\s*". In the case of address it is a bit different as you want to extract multiple chunks of data. you may want to look into re.search and group.
1

Just to add a sample to what kind of regexp you'd need to pull out the division:

re.compile('<span class="label">Division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*?)</td>')

3 Comments

[code]<td><span class="label">Phone:</span></td><td> (919) 668-5111<br> </td> [/code]. Can you help me with the regex code for this. Is this correct [code]patFinderPhone = re.compile('<span class="label">Phone:</span></td><td>\s+(.*)\s+<br>')[/code]
You're mostly there - except you're using \s+ after the number which implies that there must be atleast one whitespace. When you're not sure about the existence of whitespace - use \s*: re.compile('<span class="label">Phone:</span></td><td>\s*(.*?)\s*<br>')
Still not coming. Please see this question stackoverflow.com/questions/14889996/… .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.