2

This is a complete edit of the question because I must have asked my question poorly based on the answers - so I will try to be more clear.

I have an object that I am trying to scrape. In my code used on my laptop I have no problems getting this to work. When I transfered over to Pythonanywhere I no longer could get the information I am looking for.

The code that works on my system is:

from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import csv
import time
import re

#68 lines of code for another section of the site above this working well on my system and on pythonanywhere.

pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource)

try:
    parcel_number = bsObj.find(id="mParcelnumbersitusaddress_mParcelNumber")
    s_parcel_number =parcel_number.get_text()                         
except AttributeError as e:
    s_parcel_number = "Parcel Number not found"

# same kind of code (all working) that gets 10 more pieces of data

# Tax Year
try:
    pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
    taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]
except IndexError as e:
    s_taxes_owed_2015_yr = "No taxes due"

This code works just fine on my laptop with fireforx - on Pythonanywhere if i print the pagesource for the page I am trying to scrape I get the following where my table should be:

<table border="0" cellpadding="5" cellspacing="0" class="WithBorder" width="100%">
<tbody><tr>
<td id="TaxesBalancePaymentCalculator"><!--DONT_PRINT_START-->
<span class="InputFieldTitle" id="mTabGroup_Taxes_mTaxChargesBalancePaymentInjected_mReportProcessingNote">Please wait while your current taxes are calculated.</span><img src="images/progress.gif"/> <!--DONT_PRINT_FINISH--></td>
</tr> <!--DONT_PRINT_START-->
<script type="text/javascript">
                                function TaxesBalancePaymentCalculator_ScriptLoaded( pPageContent )
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = pPageContent;
                                }
                                function results_ready()
                                {
                                    element('pay_button_area').style.display = 'block';
                                    element('pay_button_area2').style.display = 'block';
                                    element('pay_additional_things_area').style.display = 'block';
                                }
                                var no_taxes_calculator = '&amp;nbsp;&lt;' + 'span class="MessageTitle"&gt;The tax balance calculator is not availab
le.&lt;' + '/span&gt;';
                                function no_taxes_calculator_available()
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator;
                                }
                                function invalid()
                                {
                                    element('TaxesBalancePaymentCalculator').innerHTML = no_taxes_calculator;
                                }
                                loadScript( 'injected/TaxesBalancePaymentCalculator.aspx?parcel_number=15-720-01-01-00-0-00-000' );
                                </script><script id="injected_taxesbalancepaymentcalculator_ScriptTag" type="text/javascript"></script>
<tr id="pay_button_area" style="DISPLAY: none">
<td id="pay_button_area2">
<table border="0" cellpadding="2" cellspacing="0">
<tbody><tr>

I have played around and have found that if I get the innerHTML (as a str):

element('TaxesBalancePaymentCalculator').innerHTML = pPageContent;

that section holds my data - problem is I can not preform a findAll on a string and I need certain rows from the table:

taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

I need help on how to get that element as an object (not a string) so that I can use it in my data. I have tried so many thing that I could not list them all here. I really could use some help please.

Thanks in advance.

4
  • I don't remember any findAll methods in Python. This is bs4 method... Do import bs4 within your code? What you are trying to do with bsObj? Commented Dec 15, 2015 at 14:08
  • Yes it is a bs4 method and I have imported bs4---a couple of hundred lines higher. I am trying to get the information out of the table that is in the inner HTML -- Commented Dec 15, 2015 at 14:31
  • According to the docs, driver.get_attribute returns a string, hence the error. Commented Dec 15, 2015 at 14:57
  • @Raymond, I'm afraid bs4 module works in a little bit different way... You should read about it some more crummy.com/software/BeautifulSoup/bs4/doc Commented Dec 15, 2015 at 15:11

3 Answers 3

4

I think it might be a page-loading speed difference. At the start of your code, you have

pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource)

So, you're creating your BeautifulSoup object based on the contents of the page at that point. Later on, you're doing this:

pause = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID, "TaxesBalancePaymentCalculator")))
taxes_owed_2015_yr = bsObj.findAll(id="mGrid_RealDataGrid")[1].findAll('tr')[1].findAll('td')[0]

So, you're telling WebDriver to wait until something has appeared, and then making a query to the BeautifulSoup object that you created earlier. But the BeautifulSoup object still has the page source from the start of your script -- not the new page source with the object that you waited for.

Try re-creating the bsObj based on the new page source after you've done the wait.

Sign up to request clarification or add additional context in comments.

Comments

0

As pointed by @Steve in the comments, get_attribute return string, not HTML elements. Try to replace this line with some of the get_element_by_*. You can read more on the docs http://selenium-python.readthedocs.org/api.html#selenium.webdriver.remote.webelement.WebElement.find_element_by_tag_name

Besides that, you are using beautifulsoup the wrong way. You need to create your bs4 object by passing the html as parameter, and then you use the findAll in the object:

soup = BeautifulSoup(html_as_plain_text)
for element in soup.findAll(id="mGrid_RealDataGrid"):
    #do your thing

Comments

0

From what I see in the code, you want to get the innerHTML of an element and feed it to BeautifulSoup for further parsing. First of all, you probably need outerHTML to get the element itself in the resulting HTML and, also, most importantly, you need to initialize the "soup" object:

from bs4 import BeautifulSoup

demo_div = driver.find_element_by_id('TaxesBalancePaymentCalculator')
demo_html = demo_div.get_attribute('outerHTML')

soup = BeautifulSoup(demo_html, "html.parser")  # < YOU ARE MISSING THIS PART
s_taxes_owed_2015_yr = soup.find_all(id="mGrid_RealDataGrid")[1].find_all('tr')[1].find_all('td')[0].get_text()
print(s_taxes_owed_2015_yr)

2 Comments

That looked good - but I still get a an element out of limit error because the table never loads in the pythonanywhere firefox browser.
@Raymond and that is a separate problem. Let's avoid fixing multiple issues in a single topic. Please consider creating a separate question with details if you need help. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.