How to extract the following HTML snippet with Python

Question

I have the following Python code:

def getAddress(text):
    text = re.sub('\t', '', text)
    text = re.sub('\n', '', text)
    blocks = re.findall('<div class="result-box" itemscope itemtype="http://schema.org/LocalBusiness">([a-zA-Z0-9 ",;:\.#&_=()\'<>/\\\t\n\-]*)</span>Follow company</span>', text)
    name = ''
    strasse = ''
    locality = ''
    plz = ''
    region = ''
    i = 0

    for block in blocks:
        names = re.findall('class="url">(.*)</a>', block)
        strassen = re.findall('<span itemprop="streetAddress">([a-zA-Z0-9 ,;:\.&#]*)</span>', block)
        localities = re.findall('<span itemprop="addressLocality">([a-zA-Z0-9 ,;:&]*)</span>', block)
        plzs = re.findall('<span itemprop="postalCode">([0-9]*)</span>', block)
        regions = re.findall('<span itemprop="addressRegion">([a-zA-Z]*)</span>', block)

        try:
            for name in names:
                name = str(name)
                name = re.sub('<[^<]+?>', '', name)
                break

            for strasse in strassen:
                strasse = str(strasse)
                strasse = re.sub('<[^<]+?>', '', strasse)
                break

            for locality in localities:
                locality = str(locality)
                locality = re.sub('<[^<]+?>', '', locality)
                break

            for plz in plzs:
                plz = str(plz)
                plz = re.sub('<[^<]+?>', '', plz)
                break

            for region in regions:
                region = str(region)
                region = re.sub('<[^<]+?>', '', region)
                break
        except:
            continue
        print i
        i = i + 1

        if plz == '':
            plz = getZipCode(strasse, locality, region)
        address = '"' + name + '"' + ';' + '"' + strasse + '";' + locality + ';' + str(plz) + ';' + region + '\n'

        #saveToCSV(address)

I want to filter out this html snippet. This snip gets repeated several times. I want the function to return one entry for each snippet. But instead it returns me one entry with both snippets. What do I have to change?

<div class="result-box" itemscope itemtype="http://schema.org/LocalBusiness">
        <div class="clear">
            <h2 itemprop="name"><a href="http://www.manta.com/c/mxlk5yt/belgium-jewelers-corp" class="url">Belgium Jewelers Corp</a></h2>           </div>
        <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">               <span itemprop="addressLocality">Lawrenceville</span> <span itemprop="addressRegion">NJ</span>
        </div>          <a href="#" class="followCompany" data-emid="mxlk5yt" data-companyname="Belgium Jewelers Corp" data-location="ListingFollowButton" data-location-page="Megabrowse">
            <span class="followMsg"><span class="followIcon mrs"></span>Follow company</span>
            <span class="followingMsg"><span class="followIcon mrs"></span>Following</span>
            <span class="unfollowMsg"><span class="followIcon mrs"></span>Unfollow company</span>
        </a>            <p class="type">Jewelry Stores</p>      </div>
    </li>       <li>        <div class="icons">
        <ul>            </ul>
    </div>

Why not use a HTML parser to extract that information instead? Regular expressions are not the tool to use here. — Martijn Pieters
– Martijn Pieters, Commented May 21, 2013 at 8:29

Martijn Pieters · Accepted Answer · 2013-05-21 08:37:53Z

4

Please put down that hammer; HTML is not a regular-expression shaped nail. Regular expressions to parse HTML get complicated fast, and are very fragile, easily broken when the HTML changes subtly.

Use a proper HTML parser instead. BeautifulSoup would make your task trivial:

from bs4 import BeautifulSoup

soup = BeautifulSoup(text)
for block in soup.find_all('div', class_="result-box", itemtype="http://schema.org/LocalBusiness"):
    print block.find('a', class_='url').string

    street = block.find('span', itemprop="streetAddress")
    if street:
        print street.string

    locality = block.find('span', itemprop="addressLocality")
    if locality:
        print locality.string

    # .. etc. ..

edited May 21, 2013 at 8:37

answered May 21, 2013 at 8:32

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

infoBB Over a year ago

Many thanks for all your help. I now use Beautiful Soup 4. It made the task much easier and reduced my code alot.

Glitch Desire · Accepted Answer · 2013-05-21 08:32:22Z

0

You should look at HTMLParser (documentation) for Python. Regex is notoriously bad for parsing HTML.

answered May 21, 2013 at 8:32

Glitch Desire

15.1k7 gold badges46 silver badges55 bronze badges

Collectives™ on Stack Overflow

How to extract the following HTML snippet with Python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related