0

I know similar questions have been posted before, but i haven't found something working for this case. I hope you can help.

Here is a summary of the issue:

  1. I'am writing a web scraping code using selenium(for an assignment purpose)
  2. The code utilizes a for-loop to go from one page to another
  3. The output of the code is a dataframe from each page number that is imported to excel. (basically a table)
  4. Dataframes from all the web pages to be captured in one excel sheet only.(not multiple sheets within the excel file)
  5. Each web page has the same data format (ie. number of columns and column headers are the same, but the row values vary..)
  6. For info, I'am using pandas as it is helping convert the output from the website to excel

The problem i'm facing is that when the dataframe is exported to excel, it over-writes the data from the previous iteration. hence, when i run the code and scraping is completed, I will only get the data from the last for-loop iteration.

Please advise the line(s) of coding i need to add in order for all the iterations to be captured in the excel sheet, in other words and more specifically, each iteration should export the data to excel starting from the first empty row.

Here is an extract from the code:

for i in range(50, 60):  
    url= (urlA + str(i)) #this is the url generator, URLA is the main link excluding pagination

    driver.get(url)

    time.sleep(random.randint(3,7))

    text=driver.find_element_by_xpath('/html/body/pre').text

    data=pd.DataFrame(eval(text))

    export_excel = data.to_excel(xlpath)
1
  • create only one dataframe before for-loop, inside for-loop append data to this dataframe and save it only once after for-loop. Commented Oct 8, 2019 at 23:58

1 Answer 1

1

Thanks Dijkgraaf. Your proposal worked.

Here is the full code for others (for future reference).

apologies for the font, couldnt set it properly. anyway hope below is to some use for someone in the future.

xlpath= "c:/projects/excelfile.xlsx"

df=pd.DataFrame() #creating a data frame before the for loop. (dataframe is empty before the for loop starts)

Url= www.your website.com 

for i in irange(1,10): 

       url= (urlA + str(i)) #this is url generator for pagination (to loop thru the page) 

       driver.get(url)  

       text=driver.find_element_by_xpath('/html/body/pre').text # gets text from site

       data=pd.DataFrame(eval(text)) #evalues the extracted text from site and converts to Pandas dataframe 

       df=df.append(data) #appends the dataframe (df) specificed before the for-loop and adds the new (data)

export_excel = df.to_excel(xlpath)  #exports consolidated dataframes (df) to excel
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.