I'm new to python and I have a HTML text file which I would like to scrape with python 2.7.
The code below is just an example of one firm's info. In the full html text file the code structure is the same for all other firms as well and are positioned underneath each other (if the latter info helps).
So basically, I want to extract certain information (like the firms name, location, phone number and website) in a chronological order so the data are allocated to the right organization, something like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I'm sorry if i'm not being concise enough but any suggestions on how to start the script and how it should look like would be very helpful!
The code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> <a href="http://www.liberty.edu/" target="_blank">www.liberty.edu</a> </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
EDIT:
So with the help of ajputnam i've got this now:
from lxml import html
str = open('test_html.txt', 'r').read()
tree = html.fromstring(str)
name = tree.xpath("/html/body/div/div/div[1]/strong/text()")
place = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()")
phone = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[6]/td[2]/text()")
url = tree.xpath("/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()")
print(name, place, phone, url)
Prints:
(['"Liberty Associates LLC"'], ['New York'], ['+1 973-344-8300'], ['www.liberty.edu'])
However, when i try this code on the whole html file (with more than one firms data) I get all matching variable are right behind each other. How can i properly use [0] to get the data structured like this?:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
