1

I have the following Python code:

def getAddress(text):
    text = re.sub('\t', '', text)
    text = re.sub('\n', '', text)
    blocks = re.findall('<div class="result-box" itemscope itemtype="http://schema.org/LocalBusiness">([a-zA-Z0-9 ",;:\.#&_=()\'<>/\\\t\n\-]*)</span>Follow company</span>', text)
    name = ''
    strasse = ''
    locality = ''
    plz = ''
    region = ''
    i = 0

    for block in blocks:
        names = re.findall('class="url">(.*)</a>', block)
        strassen = re.findall('<span itemprop="streetAddress">([a-zA-Z0-9 ,;:\.&#]*)</span>', block)
        localities = re.findall('<span itemprop="addressLocality">([a-zA-Z0-9 ,;:&]*)</span>', block)
        plzs = re.findall('<span itemprop="postalCode">([0-9]*)</span>', block)
        regions = re.findall('<span itemprop="addressRegion">([a-zA-Z]*)</span>', block)

        try:
            for name in names:
                name = str(name)
                name = re.sub('<[^<]+?>', '', name)
                break

            for strasse in strassen:
                strasse = str(strasse)
                strasse = re.sub('<[^<]+?>', '', strasse)
                break

            for locality in localities:
                locality = str(locality)
                locality = re.sub('<[^<]+?>', '', locality)
                break

            for plz in plzs:
                plz = str(plz)
                plz = re.sub('<[^<]+?>', '', plz)
                break

            for region in regions:
                region = str(region)
                region = re.sub('<[^<]+?>', '', region)
                break
        except:
            continue
        print i
        i = i + 1

        if plz == '':
            plz = getZipCode(strasse, locality, region)
        address = '"' + name + '"' + ';' + '"' + strasse + '";' + locality + ';' + str(plz) + ';' + region + '\n'

        #saveToCSV(address)

I want to filter out this html snippet. This snip gets repeated several times. I want the function to return one entry for each snippet. But instead it returns me one entry with both snippets. What do I have to change?

<div class="result-box" itemscope itemtype="http://schema.org/LocalBusiness">
        <div class="clear">
            <h2 itemprop="name"><a href="http://www.manta.com/c/mxlk5yt/belgium-jewelers-corp" class="url">Belgium Jewelers Corp</a></h2>           </div>
        <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">               <span itemprop="addressLocality">Lawrenceville</span> <span itemprop="addressRegion">NJ</span>
        </div>          <a href="#" class="followCompany" data-emid="mxlk5yt" data-companyname="Belgium Jewelers Corp" data-location="ListingFollowButton" data-location-page="Megabrowse">
            <span class="followMsg"><span class="followIcon mrs"></span>Follow company</span>
            <span class="followingMsg"><span class="followIcon mrs"></span>Following</span>
            <span class="unfollowMsg"><span class="followIcon mrs"></span>Unfollow company</span>
        </a>            <p class="type">Jewelry Stores</p>      </div>
    </li>       <li>        <div class="icons">
        <ul>            </ul>
    </div>
1
  • 1
    Why not use a HTML parser to extract that information instead? Regular expressions are not the tool to use here. Commented May 21, 2013 at 8:29

2 Answers 2

4

Please put down that hammer; HTML is not a regular-expression shaped nail. Regular expressions to parse HTML get complicated fast, and are very fragile, easily broken when the HTML changes subtly.

Use a proper HTML parser instead. BeautifulSoup would make your task trivial:

from bs4 import BeautifulSoup

soup = BeautifulSoup(text)
for block in soup.find_all('div', class_="result-box", itemtype="http://schema.org/LocalBusiness"):
    print block.find('a', class_='url').string

    street = block.find('span', itemprop="streetAddress")
    if street:
        print street.string

    locality = block.find('span', itemprop="addressLocality")
    if locality:
        print locality.string

    # .. etc. ..
Sign up to request clarification or add additional context in comments.

1 Comment

Many thanks for all your help. I now use Beautiful Soup 4. It made the task much easier and reduced my code alot.
0

You should look at HTMLParser (documentation) for Python. Regex is notoriously bad for parsing HTML.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.