How to scrape html tags spread over multiple lines in python?

Question

I am trying to scrape a webpage in python. I was able to easily get the results for tags which were on a single line, but for tags spread over multiple lines, my code cannot retrieve anything.

In the HTML source single line tags are present as:

<td><span class="facultyName">John Matthew Falletta, MD</span>

and multiple line tags are present as:

<td><span class="label">Division:</span>
            &nbsp;&nbsp;
                  </td><td>Hematology/Oncology</td>

Here is what I wrote:

patFinderFullname = re.compile('<span class="facultyName">(.*)</span>')

fullname = re.findall(patFinderFullname,webpage)         #works fine

patFinderDivision = re.compile('<span class="label">Division:</span>&nbsp;&nbsp;</td><td>(.*)</td>')

division = re.findall(patFinderDivision,webpage)       #doesn't work

Here my webpage variable contains the url which has to be scraped. Can someone point out, what I am missing, or where I am wrong?

Don't post an image of text. Post the text so someone can cut/paste if they want to work with it. — Mark Tolonen
– Mark Tolonen, Commented Feb 15, 2013 at 5:02
My text contains html tags, it is automatically being formatted by the editior during posting. — TheRock
– TheRock, Commented Feb 15, 2013 at 5:05
It was being formatted, that's why I posted a jpg, so that it remains exactly what I wrote. — TheRock
– TheRock, Commented Feb 15, 2013 at 5:07

jurgenreza · Accepted Answer · 2013-02-15 06:07:34Z

5

I highly recommend you use BeautifulSoup. It is a Python library for parsing HTML documents.

P.s: If you want to stick with your own code, use \s* to skip white spaces in regex.

patFinderDivision = re.compile('<span class="label">Division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*)</td>')

edited Feb 15, 2013 at 6:07

answered Feb 15, 2013 at 5:03

jurgenreza

6,1042 gold badges30 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

TheRock Over a year ago

Can you suggest how to read text from multiple line tags,like in my example I want "Hematology/Oncology" to be extracted in the variable division.

jurgenreza Over a year ago

try \s+ to avoid the white spaces in regex

TheRock Over a year ago

I got it correct for division but now Address field is baffling me :<td>Address:</td><td>Box 2991 DUMC Durham, NC  27710 </td> Is this the correct regex for this patFinderAddress = re.compile(' Address:\s+(.*)\s+</td>')

jurgenreza Over a year ago

Posting the code as a comment is not a good idea. I cannot see the new lines this way so that I can tell you where to put the "\s*". In the case of address it is a bit different as you want to extract multiple chunks of data. you may want to look into re.search and group.

TheRock Over a year ago

Please see this stackoverflow.com/questions/14889996/….

Ali-Akber Saifee · Accepted Answer · 2013-02-15 05:23:38Z

1

Just to add a sample to what kind of regexp you'd need to pull out the division:

re.compile('<span class="label">Division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*?)</td>')

answered Feb 15, 2013 at 5:23

Ali-Akber Saifee

4,6861 gold badge19 silver badges18 bronze badges

3 Comments

TheRock Over a year ago

[code]<td>Phone:</td><td> (919) 668-5111 </td> [/code]. Can you help me with the regex code for this. Is this correct [code]patFinderPhone = re.compile('Phone:</td><td>\s+(.*)\s+ ')[/code]

Ali-Akber Saifee Over a year ago

You're mostly there - except you're using \s+ after the number which implies that there must be atleast one whitespace. When you're not sure about the existence of whitespace - use \s*: re.compile('Phone:</td><td>\s*(.*?)\s* ')

TheRock Over a year ago

Still not coming. Please see this question stackoverflow.com/questions/14889996/… .

Collectives™ on Stack Overflow

How to scrape html tags spread over multiple lines in python?

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related