0

I have a big HTML file from which I need to parse some data using Regular expression. The first is the name of restaurant. Hotel names are in this format:

Update:

<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body><div class="businessresult clearfix">
        <div class="leftcol">
            <div id="bizTitle0" class="itemheading">
                <a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1.    Capannina
</a>
            </div>
                <div class="itemcategories">
                    Categories: <a href="https://courses.ischool.berkeley.edu/search?mapsize=small&amp;main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&amp;places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&amp;rpp=40&amp;bbox=-122.471809387%2C37.7384127869%2C-122.368125916%2C37.8203616433&amp;attrs=&amp;sortby=category&amp;show_more_search_options=true&amp;cflt=italian&amp;find_loc=san+francisco%2C+ca" rel="italian" class="category" id="cat_result_0_italian">Italian</a>, <a href="https://courses.ischool.berkeley.edu/search?mapsize=small&amp;main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&amp;places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&amp;rpp=40&amp;bbox=-122.471809387%2C37.7384127869%2C-122.368125916%2C37.8203616433&amp;attrs=&amp;sortby=category&amp;show_more_search_options=true&amp;cflt=seafood&amp;find_loc=san+francisco%2C+ca" rel="seafood" class="category" id="cat_result_0_seafood">Seafood</a>
                </div>
                <div class="itemneighborhoods">
                    Neighborhood: <a href="https://courses.ischool.berkeley.edu/search?find_desc=&amp;mapsize=small&amp;main_places=CA%3ASan_Francisco%3A%3ASOMA%2CCA%3ASan_Francisco%3A%3APacific_Heights%2CCA%3ASan_Francisco%3A%3AMission%2CCA%3ASan_Francisco%3A%3AHaight-Ashbury&amp;places=CA%3ASan_Francisco%3A%3A%5BSOMA%2CMission%2CMarina%2FCow_Hollow%5D&amp;attrs=&amp;sortby=category&amp;cflt=italian&amp;show_more_search_options=true&amp;parent_request_id=9536eaa25db61373&amp;find_loc=Marina%2FCow+Hollow%2C+San+Francisco%2C+CA" title="Marina/Cow Hollow, San Francisco, CA" class="location" id="hood_result_0_0">Marina/Cow Hollow</a>
                </div>
        </div>
        <div class="rightcol">
                <div class="rating"><img src="yelp_listings_files/stars_map.html" alt="4 star rating" title="4 star rating" class="stars_4 " height="325" width="83"></div> <a class="reviews" href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco">270 reviews</a>


            <address>
                1809 Union St<br>San Francisco, CA 94123<br>
                    </address><div class="phone">
                        (415) 409-8001
                    </div>


        </div>

There are altogether 40 hotels. I think there's two spaces after the . in number. I need to list all the hotels from 1 to 40. I have tried using:

re.findall("[./0-9]", string_Name)

It outputs the number. I want to get the number and all the hotel names. How can I do that?

The answer by Blender gives the rating and the restaurant list. That's fine but I want rating and the restaurant name in a different variable.

2 Answers 2

5

Parse the HTML:

import re
from bs4 import BeautifulSoup

html = '''
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1.    Capannina
</a>
<a href="https://courses.ischool.berkeley.edu/biz/ristorante-parma-san-francisco" id="bizTitleLink4">5.     Ristorante Parma
</a>
'''

soup = BeautifulSoup(html)

for link in soup.find_all('a', text=re.compile(r'^\d')):
    print link.get_text()

And the output:

1.    Capannina

5.     Ristorante Parma
Sign up to request clarification or add additional context in comments.

7 Comments

No module named bs4: Python3? Tried sudo apt-get install python-bs4 and sudo pip install beautifulsoup4.
@user2032220: pip should probably be pip3.
Actually your answer is helpful. BeautifulSoup is amazing and I installed it. But there's a bit problem in the output. I will update the question. Please see my updated question.
Your answer prints out the hotel name and the stars of them. I want stars, hotel name, telephone number and neighbourhood in a different variable.
@user2032220: You can further refine the search with keyword arguments to find_all. Read through the documentation for a bunch of examples: crummy.com/software/BeautifulSoup/bs4/doc
|
0

You shouldn't run regexes on html directly (preferring to use an HTML parser first), but try this regex:

(\d+)\.\s+([^<]+)

one or more digits

a dot

one or more whitespace characters

one or more non < letters

The presence of the brackets () creates a capture group. The contents of the capture group 1 will be the number. The contents of the capture group 2 will be the name.

2 Comments

How to specify the string like this: to list everything after this string? For example: list everything after hello?
@user2032220 Do you mean a regex like: hello(.*) and get the contents of capture 1?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.