python parse specific data on html table using lxml and xpath

Question

First of all I am new to python and Stack Overflow so please be kind.

This is the source code of the html page I want to extract data from.

Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page

  <html>
        table class="clCommonGrid" cellspacing="0">
                <thead>
                    <tr>
                        <td colspan="3">Kommande matcher</td>
                    </tr>
                    <tr>
                        <th style="width:1%;">Tid</th>
                        <th style="width:69%;">Match</th>
                        <th style="width:30%;">Arena</th>
                    </tr>
                </thead>

                <tbody class="clGrid">

            <tr class="clTrOdd">
                <td nowrap="nowrap" class="no-line-through">
                    <span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>



                </td>
                <td><a href="?scr=result&amp;fmid=2669197">Guldhedens IK - IF Warta</a></td>
                <td><a href="?scr=venue&amp;faid=847">Guldheden Södra 1 Konstgräs</a> </td>
            </tr>

            <tr class="clTrEven">
                <td nowrap="nowrap" class="no-line-through">
                    <span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>



                </td>
                <td><a href="?scr=result&amp;fmid=2669176">Romelanda UF - IK Virgo</a></td>
                <td><a href="?scr=venue&amp;faid=941">Romevi 1 Gräs</a> </td>
            </tr>

            <tr class="clTrOdd">
            <td nowrap="nowrap" class="no-line-through">
                <span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>



            </td>
            <td><a href="?scr=result&amp;fmid=2669167">Kode IF - IK Kongahälla</a></td>
            <td><a href="?scr=venue&amp;faid=912">Kode IP 1 Gräs</a> </td>
        </tr>

        <tr class="clTrEven">
            <td nowrap="nowrap" class="no-line-through">
                <span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>



            </td>
            <td><a href="?scr=result&amp;fmid=2669147">Floda BoIF - Partille IF FK </a></td>
            <td><a href="?scr=venue&amp;faid=218">Flodala IP 1</a> </td>
        </tr>


                </tbody>
        </table>
    </html>

I need to extract the time: 19:30 and the team name: Guldhedens IK - IF Warta meaning the first and the second table cell(not the third) from the first table row and 13:00/Romelanda UF - IK Virgo from the second table row etc.. from all the table rows there is.

As you can see every table row has a date right before the time so here comes the tricky part. I only want to get the time and the team names as mentioned above from those table rows where the date is equal to the date I run this code.

The only thing I managed to do so far is not much, I can only get the time and the team name using this code:

import lxml.html
html = lxml.html.parse("http://gbgfotboll.se/information/?scr=table&ftid=51168")
test=html.xpath("//*[@id='content-primary']/table[3]/tbody/tr[1]/td[1]/span/span//text()")

print test

which gives me the result ['2014-09-26', ' 19:30'] after this I'm lost on how to iterate through different table rows wanting the specific table cells where the date matches the date I run the code.

I hope you can answer as much as you can.

CodeNinja · Accepted Answer · 2014-09-21 17:42:40Z

4

If I understood you, try something like this:

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    print html.xpath(xpath1)[1], html.xpath(xpath2)[0]

I know this is fragile and there are better solutions, but it works. ;)

Edit:
Better way with using BeautifulSoup:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr') #change this to [0] to parse first table
for i in t:
    try:
        print i.find('span').get_text()[-5:], i.find('a').get_text()
    except AttributeError:
        pass

Edit2: page not responding, but that should work:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr')
time = ""
for i in t:
    try:
        dateTime = i.find('span').get_text()
        teamName = i.find('a').get_text()
        if time == dateTime[:-5]:
            print dateTime[-5,], teamName
        else:
            print dateTime, teamName
            time = dateTime[:-5]
    except AttributeError:
        pass

lxml:

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
dateTemp = ""
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//      text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == dateTemp:
        print time, teamName
    else:
        print date, time, teamName

edited Sep 21, 2014 at 17:42

answered Sep 21, 2014 at 14:44

CodeNinja

1,1792 gold badges14 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user4063374 Over a year ago

Why is the first example fragile? You just missed one thing i needed. It should only print the time and team names if the date is the same. Example: Lets say we ran the code on 2014-09-26, then when we are iterating through the table in the for loop it should check if the date matches the date in every table row: <span class="matchTid"><span>2014-09-26 Then the code would only print: 19:30 Guldhedens IK - IF Warta and 13:00 Romelanda UF - IK Virgo Because the date they have matches run date 2014-09-26 @CodeNinja If you could input that in the first example

user4063374 Over a year ago

I would vote up if i could but it says i need 15 reputation to vote. I am sorry but i hope you still will help me if you can. @CodeNinja

user4063374 Over a year ago

You forgot to answer why its fragile. Also i have read that xpath and lxml is the better solution to use when parsing from html pages. Could you also implement the date solution to your first example using lxml? Thank you @CodeNinja

CodeNinja Over a year ago

Edited. I don't know well lxml ;) lxml contains methods by which you can do better... Fragile because I used range(12) its lame ;p

user4063374 Over a year ago

Haha ok yeah thats true but i think i can fix. I still don't think you really understood what i wanted to do but with that code you gave me i can tweak it to what i want. I will use "import time" to get the current time. Anyway i will post an answer when the site is up and i tested the code ill make sure to let you know when i have done that. Thank you for your help! :D @CodeNinja

Community · Accepted Answer · 2017-05-23 11:46:33Z

So thanks to @CodeNinja help i just tweaked it a little bit to get exactly what i wanted. I imported time to get the date of the time i run the code. Anyways here is the code for what i wanted. Thank you for the help!!

import lxml.html
import time
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
currentDate = (time.strftime("%Y-%m-%d"))
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == currentDate:
        print time, teamName

So here is the FINAL version of how to do it the correct way. This will parse through all the table rows it has without using "range" in the for loop. I got this answer from my other post here: Iterate through all the rows in a table using python lxml xpath

import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'

rows_xpath = XPath("//*[@id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")

html = lxml.html.parse(url)

for row in rows_xpath(html):
    time = time_xpath(row)[0].strip()
    team = team_xpath(row)[0]
    print time, team

Collectives™ on Stack Overflow

python parse specific data on html table using lxml and xpath

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related