Web scraping in Python - but problems exporting data to excel

Question

I'm trying to export som data to excel. I'm a total beginner, so i apologise for any dumb questions.

I',m practicising scraping from a demosite webscraper.io - and so far i have found scraped the data, that i want, which is the laptop names and links for the products

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    print (full_url)

I'm having major difficulties wrapping my head around how to export the text + full_url to excel.

I have seen coding being done like this

import pandas as pd

df = pd.DataFrame(laptops)

df.to_excel("laptops_testing.xlsx", encoding="utf-8")

But when i'm doing so, i'm getting an .xlsx file which contains a lot of data and coding, that i dont want. I just want the data, that i have been printing (text) and (full_url)

The data i'm seeing in Excel is looking like this:

<div class="thumbnail">  
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/> 
<div class="caption">  
<h4 class="pull-right price">$295.99</h4>  
<h4>  
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>  
</h4>  
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>  
</div>

<div class="ratings">  
<p class="pull-right">14 reviews</p>  
<p data-rating="3">  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
</p>  
</div>  
</div>

Screenshot from google sheets:

Nadeem Ashraf · Accepted Answer · 2022-12-25 14:35:55Z

1

This is not that much hard for solve just use this code you just have to add urls and text in lists then change it into a pandas dataframe and then make a new excel file.

import pandas as pd
import numpy as np
 
import requests

from bs4 import BeautifulSoup

from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

laptop_name = []
laptop_url = []
for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    //appending name of laptops
    laptop_name.append(text)
    print (full_url)
    //appending urls
    laptop_url.append(full_url)

//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})

print(new_df)

// defining excel file 
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)

answered Dec 25, 2022 at 14:35

Nadeem Ashraf

945 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

The_N00b Over a year ago

Aah yes! The append-function. Totally forgot about that in my attempt to remember everything else that i'm learning about python. But it workd perfectly. Thank you!

RomanPerekhrest · Accepted Answer · 2022-12-25 14:57:50Z

1

Use soup.select function to find by extended css selectors.

Here's a short solution:

import requests
from bs4 import BeautifulSoup

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
           for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")

The final document would look like:

answered Dec 25, 2022 at 14:57

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

2 Comments

The_N00b Over a year ago

Thanks a lot. This works very well. The coding in the Laptops-variable, what is that called? I mean, i havent seen it done like that before. Is there some kind of name i can search for, to take i deeper dive into this approach?

RomanPerekhrest Over a year ago

@The_N00b, laptops variable is assigned with a list comprehension which gathers a list of tuples in form (a.text, a.href) upon the found <a> tags. The list of tuples is then conveniently passed to a dataframe.

Albert bayazidi · Accepted Answer · 2022-12-25 14:54:02Z

0

Try this. Remeber to import pandas And try not to run the code to many times you are sending a new request to the website each time

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)
data = []

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    data.append([text,full_url])

df = pd.DataFrame(data, columns = ["laptop name","Url"])

df.to_csv("name")

answered Dec 25, 2022 at 14:54

Albert bayazidi

604 bronze badges

2 Comments

The_N00b Over a year ago

Damn. You are such a badass. It works very well, and i just noticed, that i needed to add the price as well, but that was pretty easy to incorporate into your coding. Thanks a lot! Here is what the final coding looks like. I just added a few lines with prices.

The_N00b Over a year ago

html = r.text  soup = BeautifulSoup(html)  css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}  laptops = soup.find_all("div", attrs=css_selector) data = []  for laptop in laptops:     laptop_link = laptop.find('a')     text = laptop_link.get_text()     href = laptop_link['href']     price = laptop.find ('h4')     laptop_price = price.get_text()     full_url = f"https://webscraper.io{href}"     data.append([text,full_url,laptop_price])  df = pd.DataFrame(data, columns = ["laptop name","Url", "price"])  df.to_csv("laptop_list.xlsx)

Collectives™ on Stack Overflow

Web scraping in Python - but problems exporting data to excel

3 Answers 3

1 Comment

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related