0

I have written a script to scrape product information from online websites. The goal is to write these information out to an Excel file. Due to my limited Python knowledge, I only know how to export using Out-file in Powershell. But the result is that information for each product is printed on separate lines. I would prefer there to be one line per product.

My desired output can be seen in the picture. I would prefer to my output to look like the second version, but I can live with the first one.

enter image description here


Here is my code:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException    

url = "http://www.strem.com/"
cas = ['16940-92-4','29796-57-4','13569-57-8','15635-87-7']

for i in cas:
    driver = webdriver.Firefox()
    driver.get(url)

    driver.find_element_by_id("selectbox_input").click()
    driver.find_element_by_id("selectbox_input_cas").click()

    inputElement = driver.find_element_by_name("keyword")
    inputElement.send_keys(i)
    inputElement.submit()

    # Check if a particular element exists; returns True/False          
    def check_exists_by_xpath(xpath):
        try:
            driver.find_element_by_xpath(xpath)
        except NoSuchElementException:
            return False
        return True

    xpath1 = ".//div[@class = 'error']" # element containing error message
    xpath2 = ".//table[@class = 'product_list tiles']" # element containing table to select product from
    #xpath3 = ".//div[@class = 'catalog_number']" # when selection is needed, returns the first catalog number

    if check_exists_by_xpath(xpath1):
        print "cas# %s is not found on Strem." %i
        driver.quit() 
    else:
        if check_exists_by_xpath(xpath2):
            catNum = driver.find_element_by_xpath(".//div[@class = 'catalog_number']")
            catNum.click()

            country = driver.find_element_by_name("country")
            for option in country.find_elements_by_tag_name('option'):
                if option.text == "USA":
                    option.click()
            country.submit()

            name = driver.find_element_by_id("header_description").text
            prodNum = driver.find_element_by_class_name("catalog_number").text
            print(i)
            print(name.encode("utf-8"))
            print(prodNum)

            skus_by_xpath = WebDriverWait(driver, 10).until(
                lambda driver : driver.find_elements_by_xpath(".//td[@class='size']")
            )

            for output in skus_by_xpath:
                print(output.text)

            prices_by_xpath = WebDriverWait(driver, 10).until(
                lambda driver : driver.find_elements_by_xpath(".//td[@class='price']")
            )

            for result in prices_by_xpath:
                print(result.text[3:]) #To remove last three characters, use :-3

            driver.quit()
        else:
            country = driver.find_element_by_name("country")
            for option in country.find_elements_by_tag_name('option'):
                if option.text == "USA":
                    option.click()
            country.submit()

            name = driver.find_element_by_id("header_description").text
            prodNum = driver.find_element_by_class_name("catalog_number").text
            print(i)
            print(name.encode("utf-8"))
            print(prodNum)

            skus_by_xpath = WebDriverWait(driver, 10).until(
                lambda driver : driver.find_elements_by_xpath(".//td[@class='size']")
            )

            for output in skus_by_xpath:
                print(output.text)

            prices_by_xpath = WebDriverWait(driver, 10).until(
                lambda driver : driver.find_elements_by_xpath(".//td[@class='price']")
            )

            for result in prices_by_xpath:
                print(result.text[3:]) #To remove last three characters, use :-3

            driver.quit()

2 Answers 2

1

https://pythonhosted.org/openpyxl/tutorial.html

This is a tutorial for a python library that allows manipulation for python There are other libraries but I like using this one.

from openpyxl import Workbook wb = Workbook()

then use the methods given to write your data and then

wb.save(filename)

really easy to get started.

This is a pdf tutorial for using xlwt and xlrd, but I don't really use these modules alot. http://www.simplistix.co.uk/presentations/python-excel.pdf

Sign up to request clarification or add additional context in comments.

Comments

0

I usually find that writing to CSV is the safest way to get data into excel. I use something like the following code:

import csv
import sys
import time
import datetime
from os import fsync

ts=time.time() #get the time, to use in a filename
ds=datetime.datetime.fromtimestamp(ts).strftime('%Y%m%d%H%M') #format the time for the filename
f2=open('OutputLog_'+ds+'.txt','w') #my file is output_log + the date time stamp
f2.write(str('Column1DataPoint'+','+'Column2DataPoint') #write your text, separate your data with comma's
#if you're running a long loop, and want to keep your file up to date with the proces do these two steps in your loop too
f2.flush() 
fsync(f2.fileno())

#once the loop is finished and data is writtin, close your file
f2.close()

I think for you, the change to the above code would be to change the write line something like the following:

f2.write(str(i+','+name.encode("utf-8")+','+prodNum+','+output.text)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.