Parsing an html page using beautifulsoup/python

Question

I am currently parsing an html page to extract some information:

Sometimes there is not text after a closing tag such as in the case of Ethos in the HTML document below

<span id= "here" style>
  <br>
  <b> Post Primary</b>
  <b>school<b>
  <br>
  <b>Roll number: </b>
  "60000"
  <br>
  <b>Principal</b>
      "Paul Ince"
  <br>
  <b>Enrolment:</b>
  "Boys; 193 Girls: 190   (2012/13)"
  <br>
  <b>Ethos:</b>
  <b>Catchment:</b>
  "North Inner CIty "
  <br>
 <b>Fees:</b>
 " No "
</span>

I would like to extract the following information

Enrolment= "Boys:193 Girls: 190 (2012/13)"

Ethos= ""

Fees="No"

Have you made an attempt? It's always useful to share those, and people appreciate your efforts too. — Totem
– Totem, Commented Apr 22, 2014 at 15:26
It's worth noting that this HTML is badly formatted, e.g. missing closing <b> tags, etc. Does the real HTML you are parsing looking like this? — Hooked
– Hooked, Commented Apr 22, 2014 at 15:27
Kindly show what you've tried. The way SO works is that we expect the least bit of understanding. To be honest, the above is simple BUT can be a bit of of an issue due to the placement of the tags. It might prove to be difficult to give you a solution you don't understand yourself. :) — WGS
– WGS, Commented Apr 22, 2014 at 15:28
these docs should help: crummy.com/software/BeautifulSoup/bs3/… the Parse Tree — Totem
– Totem, Commented Apr 22, 2014 at 15:28

alecxe · Accepted Answer · 2014-04-22 15:57:07Z

3

Here's exactly what you need.

The idea is to define a list of keys/labels you are interested in, find all b elements and check if the text in the b element is in the list of key/labels. If yes - print out the text of b element and the next sibling:

from bs4 import BeautifulSoup

data = """<span id= "here" style>
 <br>
 <b> Post Primary</b>
 <b>school<b>
 <br>
 <b>Roll number: </b>b>
 "60000"
 <br>
 <b>Principal</b>
 "Paul Ince"
 <br>
 <b>Enrolment:</b>
 "Boys; 123 Girls: 102   (2012/13)"
 <br>
 <b>Ethos:</b>
 "Catholic  &nbsp "
 <b>Catchment:</b>
 "North Inner CIty "
 <br>
 <b>Fees:</b>
 " No "
</span>"""

soup = BeautifulSoup(data)

keys = ['Enrolment', 'Ethos', 'Fees']

for element in soup('b'):
    if element.text[:-1] in keys:
        print element.text + element.next_sibling.strip()

prints:

Enrolment:"Boys; 123 Girls: 102   (2012/13)"
Ethos:"Catholic  &nbsp "
Fees:" No "

Hope that helps.

edited Apr 22, 2014 at 15:57

answered Apr 22, 2014 at 15:35

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3073498 Over a year ago

Thanks. This only works based on the assumption that there is a text after the b tag. I have scenarios where there is no text after the b tag like my current example reflects.

alecxe Over a year ago

@user3073498 it actually works for your current example without any changes.

Hooked · Accepted Answer · 2014-04-22 15:29:58Z

2

Fixing the closing tags of the <b> elements, you can parse a document like this by noting that the text you are after follows a bolded tag.

import bs4
soup = bs4.BeautifulSoup(A)
data = {}

for item in soup.findAll("b"):
    next_item = item.nextSibling
    data[item.text.strip()] = next_item.string.strip()

print data

Gives a dictionary where you can extract the elements you are looking for:

{u'Ethos:': u'"Catholic  &nbsp "', u'school': u'', u'Fees:': u'" No "', u'Post Primary': u'', u'Roll number:': u'"60000"', u'Catchment:': u'"North Inner CIty "', u'Enrolment:': u'"Boys; 123 Girls: 102   (2012/13)"', u'Principal': u'"Paul Ince"'}

answered Apr 22, 2014 at 15:29

Hooked

88.9k46 gold badges197 silver badges271 bronze badges

1 Comment

user3073498 Over a year ago

Sometimes there is no text after a closing tag.

carlosdc · Accepted Answer · 2014-04-22 15:31:42Z

Here's another option. The fact that the document has html issues made it seem to me reasonable to ignore those, and just use the text of the document (BeautifulSoup provides that too). You should determine if the problems with the bold tags are yours or come from the original source.

from bs4 import BeautifulSoup

html = """
<span id= "here" style>
 <br>
  <b> Post Primary</b>
   <b>school<b>
    <br>
     <b>Roll number: </b>b>
    "60000"
<br>
<b>Principal</b>
        "Paul Ince"
        <br>
    <b>Enrolment:</b>
"Boys; 123 Girls: 102   (2012/13)"
<br>
        <b>Ethos:</b>
    "Catholic  &nbsp "
    <b>Catchment:</b>
        "North Inner CIty "
        <br>
        <b>Fees:</b>
            " No "
    </span>
"""

soup = BeautifulSoup(html)
q = soup.text
q = [item for item in q.split('\n') if item!='']
d = {}
for i in range(len(q)):
    if 'Enrolment' in q[i] or 'Ethos' in q[i] or 'Fees' in q[i]:
        d[q[i].strip()] = q[i+1].strip()

print d

Collectives™ on Stack Overflow

Parsing an html page using beautifulsoup/python

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related