0

I'm trying to export som data to excel. I'm a total beginner, so i apologise for any dumb questions.

I',m practicising scraping from a demosite webscraper.io - and so far i have found scraped the data, that i want, which is the laptop names and links for the products

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    print (full_url)

I'm having major difficulties wrapping my head around how to export the text + full_url to excel.

I have seen coding being done like this

import pandas as pd

df = pd.DataFrame(laptops)

df.to_excel("laptops_testing.xlsx", encoding="utf-8")

But when i'm doing so, i'm getting an .xlsx file which contains a lot of data and coding, that i dont want. I just want the data, that i have been printing (text) and (full_url)

The data i'm seeing in Excel is looking like this:

<div class="thumbnail">  
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/> 
<div class="caption">  
<h4 class="pull-right price">$295.99</h4>  
<h4>  
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>  
</h4>  
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>  
</div>

<div class="ratings">  
<p class="pull-right">14 reviews</p>  
<p data-rating="3">  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
</p>  
</div>  
</div>

Screenshot from google sheets:

enter image description here

3 Answers 3

1

This is not that much hard for solve just use this code you just have to add urls and text in lists then change it into a pandas dataframe and then make a new excel file.

import pandas as pd
import numpy as np
 
import requests

from bs4 import BeautifulSoup

from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

laptop_name = []
laptop_url = []
for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    //appending name of laptops
    laptop_name.append(text)
    print (full_url)
    //appending urls
    laptop_url.append(full_url)

//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})

print(new_df)

// defining excel file 
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)
Sign up to request clarification or add additional context in comments.

1 Comment

Aah yes! The append-function. Totally forgot about that in my attempt to remember everything else that i'm learning about python. But it workd perfectly. Thank you!
1

Use soup.select function to find by extended css selectors.

Here's a short solution:

import requests
from bs4 import BeautifulSoup

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
           for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")

The final document would look like:

enter image description here

2 Comments

Thanks a lot. This works very well. The coding in the Laptops-variable, what is that called? I mean, i havent seen it done like that before. Is there some kind of name i can search for, to take i deeper dive into this approach?
@The_N00b, laptops variable is assigned with a list comprehension which gathers a list of tuples in form (a.text, a.href) upon the found <a> tags. The list of tuples is then conveniently passed to a dataframe.
0

Try this. Remeber to import pandas And try not to run the code to many times you are sending a new request to the website each time

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)
data = []

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    data.append([text,full_url])

df = pd.DataFrame(data, columns = ["laptop name","Url"])

df.to_csv("name")

2 Comments

Damn. You are such a badass. It works very well, and i just noticed, that i needed to add the price as well, but that was pretty easy to incorporate into your coding. Thanks a lot! Here is what the final coding looks like. I just added a few lines with prices.
html = r.text soup = BeautifulSoup(html) css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"} laptops = soup.find_all("div", attrs=css_selector) data = [] for laptop in laptops: laptop_link = laptop.find('a') text = laptop_link.get_text() href = laptop_link['href'] price = laptop.find ('h4') laptop_price = price.get_text() full_url = f"https://webscraper.io{href}" data.append([text,full_url,laptop_price]) df = pd.DataFrame(data, columns = ["laptop name","Url", "price"]) df.to_csv("laptop_list.xlsx)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.