3

Here is what the table looks like on the web page (it's just one column):

Here is what the table looks like on the web page

Here is the HTML of the table I am trying to scrape:

Here is the HTML of the table I am trying to scrape

If it matters, that table is nested within another table.

Here is my code:

    def filter_changed_records():
        # Scrape webpage for addresses from table of changed properties
        row_number = 0
        results_frame = locate_element(
            '//*[@id="oGridFrame"]'
        )
        driver.switch_to.frame(results_frame)
        while True:
            try:
                address = locate_element("id('row" + str(row_number) +
                                         "FC')/x:td")
                print(address)
                changed_addresses.append(address)
                row_number += 1
            except:
                print("No more addresses to add.")
                break

As you can see, there is a <tr> tag with an id of row0FC. This table is dynamically generated, and each new <tr> gets an id with a increasing number: row0FC, row1FC, row2FC etc. That is how I planned on iterating through all the entries and adding them to a list.

My locate_element function is the following:

    def locate_element(path):
        element = WebDriverWait(driver, 50).until(
            EC.presence_of_element_located((By.XPATH, path)))
        return element

It always times out after 50 seconds from not finding the element. Unsure of how to proceed. Is there a better way of locating the element?

SOLUTION BY ANDERSSON

address = locate_element("//tr[@id='row%sFC']/td" % row_number).text

2 Answers 2

3

Your XPath seem to be incorrect.

Try below:

address = locate_element("//tr[@id='row%sFC']/td" % row_number)

Also note that address is a WebElement. If you want to get its text content, you should use

address = locate_element("//tr[@id='row%sFC']/td" % row_number).text
Sign up to request clarification or add additional context in comments.

12 Comments

No luck, sadly. It still can't find it. Does the xpath also need to route through the parent table, or would that not effect it?
Can you check whether your table located inside an iframe. Also add HTML for the same as text, not as image
It is in a frame. I left that part out, edited original code in post with those lines.
Did you miss return element in your locate_element() definition?
Somehow that bit got cut off. Yeah, I have that. locate_element() works fine for tons of other stuff, so that bit isn't the issue.
|
-1

Parsing html with selenium is slow. I would use BeautifulSoup for that.

Suppose you have loaded the page in driver it would be something like:

from bs4 import BeautifulSoup
....

soup = BeautifulSoup(driver.page_source, "html.parser")
td_list = soup.findAll('td')
for td in td_list:
    try:
        addr = td['title']
        print(addr)
    except:
        pass

5 Comments

Is the difference in speed big enough to justify migrating my entire script to it? It's about 500 lines of selenium so I don't want to spend the time switching to beautifulsoup if it isn't a huge difference.
That depend on the amount of pages your getting info from and how many element you're using selenium to grab. If it's a one off and time is not important, stick to selenium. In future project parse the code with something else if speed is important...
I just did a speed test. Setup was as follows. I used selenium to pull data from the white pages. 1 page that has 100 hits. Each hit contains in a result block which holds both name, address and phone number. I did 10 loops for both selenium and BeautifulSoup (html.parser) where I extracted both name, address and phone number (3 find-commands) - which for both BeautifulSoup and Selenium sums up to 3010 find-commands (10 loops * 100 person * 3 + 10*1 find result block-command) in total. With soup the total time is 13 seconds. For selenium 165 seconds - which makes soup about 12 times faster.
Wow. That's awesome. I think I'll migrate
and if you feed soup only the part of the html, that holds your data (which I did in the example) your speed improves comparing to what I did in my answer about (where I fed soup with driver.page_source). This is done like so: container = driver.find_element_by_css_selector('div.relevant.section').get_attribute("outerHTML") and then feeding only the container object to BeautifulSoup instead of the whole html page.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.