Scraping with Python and Selenium - how should I return a 'null' if element not present

Question

Good Day, I am a newbie to Python and Selenium and have searched for the solution for a while now. While some answers come close, I can't see to find one that solves my problem. The snippet of my code that is a slight problem is as follows:

for url in links:
        driver.get(url)
        company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
        date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
        title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
        urlinf = driver.current_url #url info

        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

While this does work if all elements are present (and I can see the output to Pandas dataframe), if one of the elements doesn't exist (either 'date' or 'title') Python sends out the error:

IndexError: list index out of range

what I have tried thus far:

1) created a try/except (doesn't work) 2) tried if/else (if variable is not "")

I would like to insert "Null" if the element doesn't exist so that the Pandas dataframe populates with "Null" in the event an element doesn't exist.

any assistance and guidance would be greatly appreciated.

EDIT 1:

I have tried the following:

for url in links:
        driver.get(url)
    try:
            company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
            date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
            title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
            urlinf = driver.current_url #url info
        except:
        pass
        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

and:

for url in links:
        driver.get(url)
    try:
            company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
            date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
            title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
            urlinf = driver.current_url #url info
        except (NoSuchElementException, ElementNotVisibleException, InvalidSelectorException):
        pass

        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

and:

for url in links:
        driver.get(url)
    try:
            company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
            date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
            title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
            urlinf = driver.current_url #url info
        except:
          i = 'Null'
          pass

        num_page_items = len(date)

        for i in range(num_page_items):
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

I tried the same try/except at the point of appending to Pandas.

EDIT 2 the error I get:

IndexError: list index out of range

is attributed to the line:

df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf[i]}, ignore_index=True)

Can you show your attempts with the try except.... That is the best way to handle error messages and ignore them if needed — Moshe Slavin
– Moshe Slavin, Commented Nov 22, 2018 at 6:57
I've tried quite a few iterations, and overwrote when I found that it didn't work, but what I have added to my questions what I have tried — qbbq
– qbbq, Commented Nov 22, 2018 at 7:36
I posted an answer let me know if you need any other assistance! — Moshe Slavin
– Moshe Slavin, Commented Nov 22, 2018 at 10:27

Moshe Slavin · Accepted Answer · 2018-11-22 10:23:51Z

1

As your error shows you have an index error!

To overcome that you should add a try except within the area that raises this error.

Also, you are using the driver.current_url which returns the URL. But in your inner for loop you are trying to refer to it as a list... this can be the origin of your error...

In your case try this:

for url in links:
    driver.get(url)
    company = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[2]/ul/li/div/div[1]/span""")
    date = driver.find_elements_by_xpath("""//*[contains(@id, 'node')]/div[1]/div[1]/div[2]/div/span""")
    title = driver.find_elements_by_xpath("""//*[@id="page-title"]/span""")
    urlinf = driver.current_url #url info

    num_page_items = len(date)
    for i in range(num_page_items):
        try:
            df = df.append({'Company': company[i].text, 'Date': date[i].text, 'Title': title[i].text, 'URL': urlinf}, ignore_index=True)
        except IndexError:
            df.append(None) # or df.append('Null')

Hope you find this helpfull!

answered Nov 22, 2018 at 10:23

Moshe Slavin

5,2445 gold badges27 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

qbbq Over a year ago

this solution works! thank you very much - I really appreciate it.

qbbq Over a year ago

just as a matter of interest, I tried df.append('Null') and I got this error message: 'code' TypeError: cannot concatenate object of type "<type 'str'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

qbbq Over a year ago

just an update to this, I decided to write directly to a csv, however on the original solution, the "None" / Null was creating a line break instead of making the variable = "Null". as a result I have added the following: blank = "blank" and except IndexError: with open('results.csv', 'a') as f: f.write(blank) however my data in the csv is getting offset by the missing value - would you suggest I create if statements in the loop to check if the variable = "" ?

Collectives™ on Stack Overflow

Scraping with Python and Selenium - how should I return a 'null' if element not present

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related