How to extract specific data from a HTML page with python?

Question

I'm new to python and I have a HTML text file which I would like to scrape with python 2.7.

The code below is just an example of one firm's info. In the full html text file the code structure is the same for all other firms as well and are positioned underneath each other (if the latter info helps).

So basically, I want to extract certain information (like the firms name, location, phone number and website) in a chronological order so the data are allocated to the right organization, something like this:

Liberty Associates LLC | New York    | +1 973-344-8300 | www.liberty.edu
Company B              | Los Angeles | +1 213-802-1770 | perchla.com

I'm sorry if i'm not being concise enough but any suggestions on how to start the script and how it should look like would be very helpful!

The code:

<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
            <div class="card-header">
                <strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
                <span class="tel" title="Phone contacts">Phone contacts</span>
			
            </div>
            <div class="card-content">
                
				
                <table>
                    <tbody>
                        <tr>
                            <td colspan="4">
                                
                                <label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
                            </td>
                        </tr>
                        <tr>
                            <td width="20">&nbsp;</td>
                            <td width="245">&nbsp;</td>
                            <td width="50">&nbsp;</td>
                            <td width="80">&nbsp;</td>
                        </tr>
                        <tr>
                            <td colspan="2">
59 Wall St</td>
                            <td></td>
                            <td></td>
                        </tr>
                        <tr>
                            <td colspan="2">NJ 07105&nbsp;&nbsp;
                                
                                <label class="downdrill-sbi" title="New York">New York</label>
                            </td>
                            <td></td>
                            <td></td>
                        </tr>
                        <tr>
                            <td>&nbsp;</td>
                            <td>&nbsp;</td>
                            <td>&nbsp;</td>
                            <td>&nbsp;</td>
                        </tr>
                        <tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
                        <tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
                        <tr>
                            <td colspan="2"> <a href="http://www.liberty.edu/" target="_blank">www.liberty.edu</a> </td>
                            <td>Active:</td>
                            <td>Yes</td>
                        </tr>
                    </tbody>
                </table>
            </div>
            

        </div></div></body>

How it looks like on a webpage:

EDIT:

So with the help of ajputnam i've got this now:

from lxml import html    

str = open('test_html.txt', 'r').read()
tree = html.fromstring(str)

name = tree.xpath("/html/body/div/div/div[1]/strong/text()")
place = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()")
phone = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[6]/td[2]/text()")
url = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()")

print(name, place, phone, url)

Prints:

(['"Liberty Associates LLC"'], ['New York'], ['+1 973-344-8300'], ['www.liberty.edu'])

However, when i try this code on the whole html file (with more than one firms data) I get all matching variable are right behind each other. How can i properly use [0] to get the data structured like this?:

Liberty Associates LLC | New York    | +1 973-344-8300 | www.liberty.edu
Company B              | Los Angeles | +1 213-802-1770 | perchla.com

how does it look on a webpage?

Blundering Philosopher
– Blundering Philosopher

2017-03-08 23:18:24 +00:00
Commented Mar 8, 2017 at 23:18 — Blundering Philosopher
– Blundering Philosopher, Commented Mar 8, 2017 at 23:18
@Radical Fanatic please see my updated post

jakeT888
– jakeT888

2017-03-08 23:27:32 +00:00
Commented Mar 8, 2017 at 23:27 — jakeT888
– jakeT888, Commented Mar 8, 2017 at 23:27

Arthur Putnam · Accepted Answer · 2017-03-08 23:35:35Z

7

First you will need to get the HTML from the page. you can use a library like requests to do this.

from lxml import html
import requests

page = requests.get('url')
tree = html.fromstring(page.content)

Then you can access things in the "tree" using selectors.

prices = tree.xpath('//span[@class="item-price"]/text()')

or you could just parse the string normally.

see: HTML scrapping

Reading from file

from lxml import html

# read html as string from file
str = open('file.html', 'r').read()
tree = html.fromstring(str)

company = tree.xpath('//div[@class="card-header"]/strong/text()')
print company

edited Mar 8, 2017 at 23:35

answered Mar 8, 2017 at 23:20

Arthur Putnam

1,1112 gold badges10 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jakeT888 Over a year ago

thanks for your reply. Unfortunately I don't have the page/URL. I only have the HTML code saved in a txt file on my HDD.

Arthur Putnam Over a year ago

Oh even easier. Just read the file in as a string and you can use the same steps.

jakeT888 Over a year ago

Does this method also "loops " so it also crawls other firms data when it's done with the first html "block"?

Arthur Putnam Over a year ago

Yes and no, the selector will grab all the html blocks that match that pattern. So if there are more than one it will grab them.

Arthur Putnam Over a year ago

Yes, selectors grab them in order. So array1 [0] will go with array2 [0]. Make sense?

|

Collectives™ on Stack Overflow

How to extract specific data from a HTML page with python?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related