2

I have tried in anger to parse the following representative HTML extract, using BeautifulSoup and lxml:

[<p class="fullDetails">
<strong>Abacus Trust Company Limited</strong>
<br/>Sixty Circular Road

            <br/>DOUGLAS

            <br/>ISLE OF MAN
            <br/>IM1 1SA
            <br/>
<br/>Tel: 01624 689600
            <br/>Fax: 01624 689601
        <br/>
<br/>
<span class="displayBlock" id="ctl00_ctl00_bodycontent_MainContent_Email">E-mail:  </span>
<a href="mailto:[email protected]" id="ctl00_ctl00_bodycontent_MainContent_linkToEmail">[email protected]</a>
<br/>
<span id="ctl00_ctl00_bodycontent_MainContent_Web">Web: </span>
<a href="http://www.abacusiom.com" id="ctl00_ctl00_bodycontent_MainContent_linkToSite">http://www.abacusiom.com</a>
<br/>
<br/><b>Partners(s) - ICAS members only:</b> S H Fleming, M J MacBain
        </p>]

What I want to do:

  • Extract 'strong' text into company_name

  • Extract 'br' tags text into company_line_x

  • Extract 'MainContent_Email' text into company_email

  • Extract 'MainContent_Web' text into company_web

The problems I was having:

1) I could extract all text by using .findall(text=True), but there was a lot of padding in each line

2) Non-ASCII chars are sometimes returned and this would cause the csv.writer to fail.. I'm not 100% sure how to handle this correctly. (I previously just used unicodecsv.writer)

Any advice would be MUCH appreciated!

At the moment, my function just receives page data and isolates the 'p class'

def get_company_data(page_data):
    if not page_data:
        pass
    else:
        company_dets=page_data.findAll("p",{"class":"fullDetails"})
        print company_dets
        return company_dets
4
  • How do you get the page data in the first place? Commented Sep 2, 2014 at 12:01
  • Thanks for the reply. I pull the data using the Requests module and just pass the page data to this function Commented Sep 2, 2014 at 12:25
  • Ok, are you using response text or content attribute? Commented Sep 2, 2014 at 12:49
  • I got it working - I am using the text attribute, but was 'souping' it in a function which pulls the page data, so I just removed that step from your code - works perfectly! Thanks so much for that ;) Commented Sep 2, 2014 at 13:04

2 Answers 2

3

Here's a complete solution:

from bs4 import BeautifulSoup, NavigableString, Tag

data = """
your html here
"""

soup = BeautifulSoup(data)
p = soup.find('p', class_='fullDetails')

company_name = p.strong.text
company_lines = []
for element in p.strong.next_siblings:
    if isinstance(element, NavigableString):
        text = element.strip()
        if text:
            company_lines.append(text)

company_email = p.find('span', text=lambda x: x.startswith('E-mail:')).find_next_sibling('a').text
company_web = p.find('span', text=lambda x: x.startswith('Web:')).find_next_sibling('a').text

print company_name
print company_lines
print com[enter link description here][1]pany_email, company_web

Prints:

Abacus Trust Company Limited
[u'Sixty Circular Road', u'DOUGLAS', u'ISLE OF MAN', u'IM1 1SA', u'Tel: 01624 689600', u'Fax: 01624 689601', u'S H Fleming, M J MacBain']
[email protected] http://www.abacusiom.com

Note that to get the company lines we have to iterate over the strong tag's next siblings and get all of the text nodes. company_email and company_web are retrieved by labels, in other words, by the text of preceding to them span tags.

Sign up to request clarification or add additional context in comments.

Comments

1

Same as you have done for p data, by using findall()

(I use lxml for the below sample codes)

To get company name:

company_name  = ''
for strg in root.findall('strong'):
    company_name = strg.text     # this will give you Abacus Trust Company Limited

To get company lines/details:

company_line_x = ''
lines = []
for b in root.findall('br'):
    if b.tail:
        addr_line = b.tail.strip()
        lines.append(addr_line) if addr_line != '' else None

company_line_x = ', '.join(lines) # this will give you Sixty Circular Road, DOUGLAS, ISLE OF MAN, IM1 1SA, Tel: 01624 689600, Fax: 01624 689601

4 Comments

The OP is using BeautifulSoup.
OP says using BeautifulSoup and lxml so I based my suggestions on lxml. Anyways the idea remains more or less same.
You are right, misread this part. Note that you are missing email and web parts currently. Thanks.
You have the complete answer already and im already feeling lazy ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.