1

I am currently parsing an html page to extract some information:

Sometimes there is not text after a closing tag such as in the case of Ethos in the HTML document below

<span id= "here" style>
  <br>
  <b> Post Primary</b>
  <b>school<b>
  <br>
  <b>Roll number: </b>
  "60000"
  <br>
  <b>Principal</b>
      "Paul Ince"
  <br>
  <b>Enrolment:</b>
  "Boys; 193 Girls: 190   (2012/13)"
  <br>
  <b>Ethos:</b>
  <b>Catchment:</b>
  "North Inner CIty "
  <br>
 <b>Fees:</b>
 " No "
</span>

I would like to extract the following information

Enrolment= "Boys:193 Girls: 190 (2012/13)"

Ethos= ""

Fees="No"

4
  • Have you made an attempt? It's always useful to share those, and people appreciate your efforts too. Commented Apr 22, 2014 at 15:26
  • It's worth noting that this HTML is badly formatted, e.g. missing closing <b> tags, etc. Does the real HTML you are parsing looking like this? Commented Apr 22, 2014 at 15:27
  • Kindly show what you've tried. The way SO works is that we expect the least bit of understanding. To be honest, the above is simple BUT can be a bit of of an issue due to the placement of the tags. It might prove to be difficult to give you a solution you don't understand yourself. :) Commented Apr 22, 2014 at 15:28
  • these docs should help: crummy.com/software/BeautifulSoup/bs3/… the Parse Tree Commented Apr 22, 2014 at 15:28

3 Answers 3

3

Here's exactly what you need.

The idea is to define a list of keys/labels you are interested in, find all b elements and check if the text in the b element is in the list of key/labels. If yes - print out the text of b element and the next sibling:

from bs4 import BeautifulSoup

data = """<span id= "here" style>
 <br>
 <b> Post Primary</b>
 <b>school<b>
 <br>
 <b>Roll number: </b>b>
 "60000"
 <br>
 <b>Principal</b>
 "Paul Ince"
 <br>
 <b>Enrolment:</b>
 "Boys; 123 Girls: 102   (2012/13)"
 <br>
 <b>Ethos:</b>
 "Catholic  &nbsp "
 <b>Catchment:</b>
 "North Inner CIty "
 <br>
 <b>Fees:</b>
 " No "
</span>"""

soup = BeautifulSoup(data)

keys = ['Enrolment', 'Ethos', 'Fees']

for element in soup('b'):
    if element.text[:-1] in keys:
        print element.text + element.next_sibling.strip()

prints:

Enrolment:"Boys; 123 Girls: 102   (2012/13)"
Ethos:"Catholic  &nbsp "
Fees:" No "

Hope that helps.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. This only works based on the assumption that there is a text after the b tag. I have scenarios where there is no text after the b tag like my current example reflects.
@user3073498 it actually works for your current example without any changes.
2

Fixing the closing tags of the <b> elements, you can parse a document like this by noting that the text you are after follows a bolded tag.

import bs4
soup = bs4.BeautifulSoup(A)
data = {}

for item in soup.findAll("b"):
    next_item = item.nextSibling
    data[item.text.strip()] = next_item.string.strip()

print data

Gives a dictionary where you can extract the elements you are looking for:

{u'Ethos:': u'"Catholic  &nbsp "', u'school': u'', u'Fees:': u'" No "', u'Post Primary': u'', u'Roll number:': u'"60000"', u'Catchment:': u'"North Inner CIty "', u'Enrolment:': u'"Boys; 123 Girls: 102   (2012/13)"', u'Principal': u'"Paul Ince"'}

1 Comment

Sometimes there is no text after a closing tag.
1

Here's another option. The fact that the document has html issues made it seem to me reasonable to ignore those, and just use the text of the document (BeautifulSoup provides that too). You should determine if the problems with the bold tags are yours or come from the original source.

from bs4 import BeautifulSoup

html = """
<span id= "here" style>
 <br>
  <b> Post Primary</b>
   <b>school<b>
    <br>
     <b>Roll number: </b>b>
    "60000"
<br>
<b>Principal</b>
        "Paul Ince"
        <br>
    <b>Enrolment:</b>
"Boys; 123 Girls: 102   (2012/13)"
<br>
        <b>Ethos:</b>
    "Catholic  &nbsp "
    <b>Catchment:</b>
        "North Inner CIty "
        <br>
        <b>Fees:</b>
            " No "
    </span>
"""

soup = BeautifulSoup(html)
q = soup.text
q = [item for item in q.split('\n') if item!='']
d = {}
for i in range(len(q)):
    if 'Enrolment' in q[i] or 'Ethos' in q[i] or 'Fees' in q[i]:
        d[q[i].strip()] = q[i+1].strip()

print d

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.