HTML file parsing in Python

Question

I have a very long html file that looks exactly like this - html file . I want to be able to parse the file such that I get the information in the form on a tuple .

Example:

<tr>
      <td>Cech</td>
      <td>Chelsea</td>
      <td>30</td>
      <td>£6.4</td>
</tr>

The above information will look like ("Cech", "Chelsea", 30, 6.4). However if you look closely at the link i posted, the html example i posted comes under a <h2>Goalkeepers</h2> tag. i need this tag too. So basically the result tuple will look like ("Cech", "Chelsea", 30, 6.4, Goalkeepers) . Further down the file a bunch of players come under <h2> tags of Midfielders , Defenders and Forwards.

I tried using beautifulsoup and ntlk libraries and got lost. So now I have the following code:

import nltk
from urllib import urlopen

url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw

which just strips of the html file of all the tags and gives something like this:

          Cech
          Chelsea
          30
          £6.4

Although I can write a bad piece of code that reads every line and can assign it to a tuple. i cannot come up with any solution which can also incorporate the player position ( the string present in the <h2> tags). Any solution / suggestions will be greatly appreciated.

The reason I am inclined towards using tuples i so that I can use unpacking and plan on populating a MySQl table with the unpacked values.

I suppose you now see, in light of the answer, that ntlk was the wrong tool for the job. — msw
– msw, Commented Oct 19, 2013 at 0:19
I tried playing with nltk because i was having a hard time using it. It looked pretty easy but gave me a recurssion error . It took a while to understand what the problem was — begin.py
– begin.py, Commented Oct 19, 2013 at 1:04

Foo Bar User · Accepted Answer · 2013-10-19 00:12:01Z

2

from bs4 import BeautifulSoup
from pprint import pprint

soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables

first = True
title =""
players = []
for i,table in enumerate(tables):
    if first:
         #every h2 element has 2 tables. table size = 8, h2 size = 4
         #so for every 2 tables 1 h2
         title =  h2s[int(i/2)].text
    for tr in table.select("tr"):
        player = (title,) #create a player
        for td in tr.select("td"):
            player = player + (td.text,) #add td info in the player
        if len(player) > 1: 
            #If the tr contains a player and its not only ("Goalkeaper") add it
            players.append(player)
    first = not first
pprint(players)

output:

[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
 ('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
 ('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
 ('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
 ('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
 ('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
 ('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
 ('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
 ('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
 ('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
 ('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
 ('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
 ('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
 ('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
 ('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
 ('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
 ('Defenders', 'Baines', 'Everton', '43', '£7.7'),
 ('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
 ('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
 ('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
 ('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
 ('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
 ('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
 ('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
 ('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
 ('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
 ('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
 ('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
 ('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
 ('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
 ('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
 ('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
 ('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
 ('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
 ('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
 ('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
 ('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

edited Oct 19, 2013 at 0:12

answered Oct 18, 2013 at 23:29

Foo Bar User

2,5013 gold badges21 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

begin.py Over a year ago

I am not sure what the above code is for. The code I posted in the question using the ntlk module does exactly the same of what ur code does. In fact ur code even deletes the Defender, Midfielders and Forwards tags completely which is actually needed in my output

Foo Bar User Over a year ago

i think that's what you are looking for? if it's confusing let me know to add comments.

begin.py Over a year ago

Looks exactly like what i need. However my output looks like this (u'Goalkeepers', u'Cech', u'Chelsea', u'30', u'\xa36.4') . Is this cz i am missing a python plugin?

Foo Bar User Over a year ago

i did it in python3.3. python2.x deals with strings/unicode in a different way. u stands for unicode.

begin.py Over a year ago

let us continue this discussion in chat

|

Collectives™ on Stack Overflow

HTML file parsing in Python

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related