0

I am trying to scrape some data from a website and save it on to csv file. When i get the scaraped data i have a huge space between each line. I want to be able to remove this unnecessary space. Below is my code

from bs4 import BeautifulSoup
import requests
import csv

#URL to be scraped
url_to_scrape = 'https://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/CategoryDisplay?langId=44&storeId=10151&catalogId=10241&categoryId=310864&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&beginIndex=0&promotionId=&listId=&searchTerm=&hasPreviousOrder=&previousOrderId=&categoryFacetId1=&categoryFacetId2=&ImportedProductsCount=&ImportedStoreName=&ImportedSupermarket=&bundleId=&parent_category_rn=13343&top_category=13343&pageSize=120#langId=44&storeId=10151&catalogId=10241&categoryId=310864&parent_category_rn=13343&top_category=13343&pageSize=120&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&searchTerm=&beginIndex=0&hideFilters=true'
#Load html's plain data into a variable
plain_html_text = requests.get(url_to_scrape)
#parse the data
soup = BeautifulSoup(plain_html_text.text, "lxml")
#
# #Get the name of the class

csv_file = open('sainsburys.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Description','Price'])

for name_of in soup.find_all('li',class_='gridItem'):
    name = name_of.h3.a.text
    print(name)
    try:
        price = name_of.find('div', class_='product')
        pricen = price.find('div', class_='addToTrolleytabBox').p.text
        print(pricen)
        csv_writer.writerow([name, pricen])
    except:
        print('Sold Out')
        print()

csv_writer.writerow([name, pricen])
csv_file.close()

The results that i get is this:

                                       J. James Chicken Goujons 270g



        £1.25/unit


                                        Sainsbury's Chicken Whole Bird (approx. 0.9-1.35kg)



        £1.90/kg


                                        Sainsbury's British Fresh Chicken Fajita Mini Fillets 320g



        £2.55/unit


                                        Sainsbury's Slow Cook Fire Cracker Chicken 573g



        £4.75/unit

Thank you

3 Answers 3

2

If you log your network traffic and filter it to view only XHR-resources, you'll find one that talks with an AJAX web app. It talks to the server and the server yields HTML (unfortunately, not entirely JSON. It's HTML baked into a JSON response). This isn't really required, since your code seems to be scraping the page OK. It is a cuter way of getting the products, however. You also don't have to worry about things like pagination. To strip the leading and trailing whitespace, as others have already pointed out, use str.strip. In this example I'm only printing the first ten products (out of 114). And yes, I realize I could have just appended the query string to the url rather than create a params dict, but it's easier to read and make changes this way:

import requests
from bs4 import BeautifulSoup


class Product:

    def __init__(self, html):
        from bs4 import BeautifulSoup

        soup = BeautifulSoup(html, "html.parser")
        self.name, _, self.weight = soup.find("a").text.strip().rpartition(" ")
        self.price_per_unit = soup.find("p", {"class": "pricePerUnit"}).text.strip()
        self.price_per_measure = soup.find("p", {"class": "pricePerMeasure"}).text.strip()


    def __str__(self):
        return f"\"{self.name}\" ({self.weight}) - {self.price_per_unit}"

url = "https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/AjaxApplyFilterBrowseView"

params = {
    "langId": "44",
    "storeId": "10151",
    "catalogId": "10241",
    "categoryId": "310864",
    "parent_category_rn": "13343",
    "top_category": "13343",
    "pageSize": "120",
    "orderBy": "FAVOURITES_ONLY|SEQUENCING|TOP_SELLERS",
    "searchTerm": "",
    "beginIndex": "0",
    "hideFilters": "true",
    "requesttype": "ajax"
}

response = requests.get(url, params=params)
response.raise_for_status()

product_info = response.json()[4]["productLists"][0]["products"]

products = [Product(p["result"]) for p in product_info[:10]]

for product in products:
    print(product)

Output:

"Sainsbury's Chicken Thigh Fillets" (640g) - £3.40/unit
"Sainsbury's Mini Chicken Breast Fillets" (320g) - £2.00/unit
"Sainsbury's Chicken Thighs" (1kg) - £1.95/unit
"Sainsbury's Chicken Breast Fillets" (300g) - £1.70/unit
"Sainsbury's Chicken Drumsticks" (1kg) - £1.70/unit
"Sainsbury's Chicken Thigh Fillets" (320g) - £1.85/unit
"Sainsbury's Chicken Breast Diced" (410g) - £2.40/unit
"Sainsbury's Chicken Small Whole Bird" (1.35kg) - £2.80/unit
"Sainsbury's Chicken Thighs & Drumsticks" (540g) - £1.00/unit
"Sainsbury's Chicken Breast Fillets" (640g) - £3.60/unit
>>> product.price_per_measure
'£5.63/kg'
>>> 
Sign up to request clarification or add additional context in comments.

3 Comments

Hey, this seems to work better but i am not sure how i can get the price as well now. I want it to show the price of each product as well. Thank you!
@RamgithUnniJagajith I've updated the code in my answer, take a look. I've basically just created a Product class that extracts the data it's interested in from the HTML. Every product has a name, weight price_per_unit and price_per_measure.
Perfect! Thanks a lot
0

You could use .strip()... it removes leading and trailing spaces

>>> s = "      I'm a sentence     "
>>>s.strip()
I'm a sentence

Applied to your problem

for name_of in soup.find_all('li',class_='gridItem'):
    name = name_of.h3.a.text.strip()
    print(name)
    try:
        price = name_of.find('div', class_='product')
        pricen = price.find('div', class_='addToTrolleytabBox').p.text.strip()
        print(pricen)
        csv_writer.writerow([name, pricen])
    except:
        print('Sold Out')
        print()

csv_writer.writerow([name, pricen])
csv_file.close()

Without being able to replicate your code I can't test it. But this should work if pricen and name are string with significant trailing and leading space.


I hope this helps!

Comments

0

str.strip() will strip all the whitespace characters from both sides:

>>> "              a             ".strip()
'a'

Just apply this to every print statement.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.