How to write this in regular expression in Python?

Question

I have a big HTML file from which I need to parse some data using Regular expression. The first is the name of restaurant. Hotel names are in this format:

Update:

<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body><div class="businessresult clearfix">
        <div class="leftcol">
            <div id="bizTitle0" class="itemheading">
                <a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1.    Capannina
</a>
            </div>
                <div class="itemcategories">
                    Categories: <a href="https://courses.ischool.berkeley.edu/search?mapsize=small&amp;main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&amp;places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&amp;rpp=40&amp;bbox=-122.471809387%2C37.7384127869%2C-122.368125916%2C37.8203616433&amp;attrs=&amp;sortby=category&amp;show_more_search_options=true&amp;cflt=italian&amp;find_loc=san+francisco%2C+ca" rel="italian" class="category" id="cat_result_0_italian">Italian</a>, <a href="https://courses.ischool.berkeley.edu/search?mapsize=small&amp;main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&amp;places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&amp;rpp=40&amp;bbox=-122.471809387%2C37.7384127869%2C-122.368125916%2C37.8203616433&amp;attrs=&amp;sortby=category&amp;show_more_search_options=true&amp;cflt=seafood&amp;find_loc=san+francisco%2C+ca" rel="seafood" class="category" id="cat_result_0_seafood">Seafood</a>
                </div>
                <div class="itemneighborhoods">
                    Neighborhood: <a href="https://courses.ischool.berkeley.edu/search?find_desc=&amp;mapsize=small&amp;main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&amp;places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&amp;attrs=&amp;sortby=category&amp;cflt=italian&amp;show_more_search_options=true&amp;parent_request_id=9536eaa25db61373&amp;find_loc=Marina%2FCow+Hollow%2C+San+Francisco%2C+CA" title="Marina/Cow Hollow, San Francisco, CA" class="location" id="hood_result_0_0">Marina/Cow Hollow</a>
                </div>
        </div>
        <div class="rightcol">
                <div class="rating"><img src="yelp_listings_files/stars_map.html" alt="4 star rating" title="4 star rating" class="stars_4 " height="325" width="83"></div> <a class="reviews" href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco">270 reviews</a>


            <address>
                1809 Union St<br>San Francisco, CA 94123<br>
                    </address><div class="phone">
                        (415) 409-8001
                    </div>


        </div>

There are altogether 40 hotels. I think there's two spaces after the . in number. I need to list all the hotels from 1 to 40. I have tried using:

re.findall("[./0-9]", string_Name)

It outputs the number. I want to get the number and all the hotel names. How can I do that?

The answer by Blender gives the rating and the restaurant list. That's fine but I want rating and the restaurant name in a different variable.

Blender · Accepted Answer · 2013-04-24 05:20:46Z

5

Parse the HTML:

import re
from bs4 import BeautifulSoup

html = '''
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1.    Capannina
</a>
<a href="https://courses.ischool.berkeley.edu/biz/ristorante-parma-san-francisco" id="bizTitleLink4">5.     Ristorante Parma
</a>
'''

soup = BeautifulSoup(html)

for link in soup.find_all('a', text=re.compile(r'^\d')):
    print link.get_text()

And the output:

1.    Capannina

5.     Ristorante Parma

answered Apr 24, 2013 at 5:20

Blender

300k55 gold badges462 silver badges511 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

pynovice Over a year ago

No module named bs4: Python3? Tried sudo apt-get install python-bs4 and sudo pip install beautifulsoup4.

Blender Over a year ago

@user2032220: pip should probably be pip3.

pynovice Over a year ago

Actually your answer is helpful. BeautifulSoup is amazing and I installed it. But there's a bit problem in the output. I will update the question. Please see my updated question.

pynovice Over a year ago

Your answer prints out the hotel name and the stars of them. I want stars, hotel name, telephone number and neighbourhood in a different variable.

Blender Over a year ago

@user2032220: You can further refine the search with keyword arguments to find_all. Read through the documentation for a bunch of examples: crummy.com/software/BeautifulSoup/bs4/doc

|

Patashu · Accepted Answer · 2013-04-24 05:23:18Z

0

You shouldn't run regexes on html directly (preferring to use an HTML parser first), but try this regex:

(\d+)\.\s+([^<]+)

one or more digits

a dot

one or more whitespace characters

one or more non < letters

The presence of the brackets () creates a capture group. The contents of the capture group 1 will be the number. The contents of the capture group 2 will be the name.

answered Apr 24, 2013 at 5:23

Patashu

21.8k4 gold badges49 silver badges53 bronze badges

2 Comments

pynovice Over a year ago

How to specify the string like this: to list everything after this string? For example: list everything after hello?

Patashu Over a year ago

@user2032220 Do you mean a regex like: hello(.*) and get the contents of capture 1?

Collectives™ on Stack Overflow

How to write this in regular expression in Python?

2 Answers 2

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related