Scrape a web page's contents using Python/selenium

Question

I'm trying to scrape the contents of a table. I believe the table is rendered in JavaScript, so I'm using the selenium package and Python3. To do such a task, I've seen others find the tables xpath in order to scrape its contents, but I'm just not sure how to identify the correct xpath.

How can I extract the tables contents? If using a xpath, how do I identify the correct xpath(s) corresponding to the table or its contents by inspecting the web page's source?

from selenium import webdriver                                                                                                                                                                                                                                              
driver = webdriver.Chrome('path/to/chromedriver.exe')                                      
url = https://ultrasignup.com/results_event.aspx?did=6727
driver.get(url)

# Now I need to get the tables contents. I might do something like this:
table = driver.find_elements_by_xpath('my_xpath')
table_html = table.get_attribute('innerHTML') # not sure what innerHTML is...
df = read_html(table_html)[0]
print(df)
driver.close()

I believe there is no need to scrape, because they have an API. If you visit this link you will see nicely formatted data from the table you provided: ultrasignup.com/service/events.svc/results/6727/json?rows=1500 — andreilozhkin
– andreilozhkin, Commented Jun 23, 2019 at 18:28
The page-under-test has many page elements with id attributes. Locating via id will be less fragile; YMMV. — orde
– orde, Commented Jun 23, 2019 at 18:33
@andreilozhkin you began to post some code that looked helpful, but then removed it. I could accept your answer if you put it back up! — twb10
– twb10, Commented Jun 23, 2019 at 19:27

andreilozhkin · Accepted Answer · 2019-06-23 19:02:36Z

1

I believe there is no need to scrape, because they have an API.

If you visit this link you will see nicely formatted data from the table you provided: https://ultrasignup.com/service/events.svc/results/6727/json

Some code:

import json, requests

url = 'https://ultrasignup.com/service/events.svc/results/6727/json'

response = requests.get(url)

# Get all people from the table
people = [x for x in response.json()] 

# Print first person's information
print(people[0])

Hope it helps!

answered Jun 23, 2019 at 19:02

andreilozhkin

5351 gold badge4 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

THE YOGOVO · Accepted Answer · 2019-06-23 18:59:52Z

0

You can identify the correct xpath by inspecting the elements of the table and seeing the source code. After you see in which tags is the table content present you have to make your xpath step-wise.

For example:


<div class="test">
<p class="test2">
<table class="test3"> 
<!--May have more attributes-->
contents...
</table>
</p>
</div>

Then you begin your xpath with //div[@class="test"] Now you are inside div,

Next step: //div[@class="test"]//p[@class="test2"] Now you are inside paragraph tag

Final Step:

xpath = "//div[@class='test']//p[@class='test2']//table[@class='test3']"

table = driver.find_elements_by_xpath('xpath')

Now you can access table and get whatever attributes you want or even the table contents

answered Jun 23, 2019 at 18:59

THE YOGOVO

1492 silver badges12 bronze badges

1 Comment

twb10 Over a year ago

Thanks YOGOVO, this begins to help me better understand the structure of the html source code. Would you be able to identify examples xpaths based on the webpage example I provided? I am still struggling to identify the correct tags from the source code.

Collectives™ on Stack Overflow

Scrape a web page's contents using Python/selenium

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related