Extract information from website using Xpath, Python

Question

Trying to extract some useful information from a website. I came a bit now im stuck and in need of your help!

I need the information from this table

http://gbgfotboll.se/serier/?scr=scorers&ftid=57700

I wrote this code and i got the information that i wanted:

import lxml.html
from lxml.etree import XPath

url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")

rows_xpath = XPath("//*[@id='content-primary']/div[1]/table/tbody/tr")
name_xpath = XPath("td[1]//text()")
team_xpath = XPath("td[2]//text()")

league_xpath = XPath("//*[@id='content-primary']/h1//text()")


html = lxml.html.parse(url)

divName = league_xpath(html)[0]

for id,row in enumerate(rows_xpath(html)):
    scorername = name_xpath(row)[0]
    team = team_xpath(row)[0]
    print scorername, team


print divName

I get this error

    scorername = name_xpath(row)[0]
IndexError: list index out of range

I do understand why i get the error. What i really need help with is that i only need the first 12 rows. This is what the extract should do in these three possible scenarios:

If there are less than 12 rows: Take all the rows except THE LAST ROW.

If there are 12 rows: same as above..

If there are more than 12 rows: Simply take the first 12 rows.

How can i can i do this?

EDIT1

It is not a duplicate. Sure it is the same site. But i have already done what that guy wanted to which was to get all the values from the row. Which i can already do. I don't need the last row and i dont want it to extract more than 12 rows if there is..

possible duplicate of Extracting information from a table on a website using python, LXML & XPATH — felipsmartins
– felipsmartins, Commented Apr 12, 2015 at 23:28

marc_s · Accepted Answer · 2017-03-18 00:08:33Z

1

I think is it what you want:

#coding: utf-8
from lxml import etree
import lxml.html

collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[@id="content-primary"]/div/table[1]/tbody/tr')
# If there are less than 12 rows (or <=12): Take all the rows except the last.
if len(rows) <= 12:
    rows.pop() 
else:
    # If there are more than 12 rows: Simply take the first 12 rows.
    rows = rows[0:12]

for row in rows:
    # all columns of current table row (Spelare, Lag, Mal, straffmal)
    columns = row.findall("td")
    # pick textual data from each <td>
    collected.append([column.text for column in columns])

for i in collected: print i

Output:

enter image description here

edited Mar 18, 2017 at 0:08

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Apr 13, 2015 at 0:27

felipsmartins

13.6k4 gold badges51 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AppDev Over a year ago

Absolutely perfect, I used the techniques you used to my own code. Thank you so much

Joe T. Boka · Accepted Answer · 2015-04-12 23:15:44Z

0

This is how you can get the rows you need based on what you described in your post. This is just the logic based on concept that rows is a list, you have to incorporate this into your code as needed.

if len(rows) <=12:
    print rows[0:-1]
elif len(rows) > 12:
    print rows[0:12]

answered Apr 12, 2015 at 23:15

Joe T. Boka

6,5896 gold badges33 silver badges49 bronze badges

3 Comments

AppDev Over a year ago

But it is just printing out elements? I don't see how i can access the individual elements like i do in my code?

Joe T. Boka Over a year ago

@AppDev I put print there but but you can do anything you need to with this. This answer your question in your post: "if there are less than 12 rows: Take all the rows except THE LAST ROW. If there are 12 rows: same as above..If there are more than 12 rows: Simply take the first 12 rows. How can I do this?"

Joe T. Boka Over a year ago

@AppDev instead of print you can just have a variable like x like so: x = rows[0:-1] or x = rows[0:12] then you can iterate through x and access the individual elements

Collectives™ on Stack Overflow

Extract information from website using Xpath, Python

2 Answers 2

Output:

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Output:

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related