Removing spaces between scraped data - Python

Question

I am trying to scrape some data from a website and save it on to csv file. When i get the scaraped data i have a huge space between each line. I want to be able to remove this unnecessary space. Below is my code

from bs4 import BeautifulSoup
import requests
import csv

#URL to be scraped
url_to_scrape = 'https://www.sainsburys.co.uk/shop/gb/groceries/meat-fish/CategoryDisplay?langId=44&storeId=10151&catalogId=10241&categoryId=310864&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&beginIndex=0&promotionId=&listId=&searchTerm=&hasPreviousOrder=&previousOrderId=&categoryFacetId1=&categoryFacetId2=&ImportedProductsCount=&ImportedStoreName=&ImportedSupermarket=&bundleId=&parent_category_rn=13343&top_category=13343&pageSize=120#langId=44&storeId=10151&catalogId=10241&categoryId=310864&parent_category_rn=13343&top_category=13343&pageSize=120&orderBy=FAVOURITES_ONLY%7CSEQUENCING%7CTOP_SELLERS&searchTerm=&beginIndex=0&hideFilters=true'
#Load html's plain data into a variable
plain_html_text = requests.get(url_to_scrape)
#parse the data
soup = BeautifulSoup(plain_html_text.text, "lxml")
#
# #Get the name of the class

csv_file = open('sainsburys.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Description','Price'])

for name_of in soup.find_all('li',class_='gridItem'):
    name = name_of.h3.a.text
    print(name)
    try:
        price = name_of.find('div', class_='product')
        pricen = price.find('div', class_='addToTrolleytabBox').p.text
        print(pricen)
        csv_writer.writerow([name, pricen])
    except:
        print('Sold Out')
        print()

csv_writer.writerow([name, pricen])
csv_file.close()

The results that i get is this:

                                       J. James Chicken Goujons 270g



        £1.25/unit


                                        Sainsbury's Chicken Whole Bird (approx. 0.9-1.35kg)



        £1.90/kg


                                        Sainsbury's British Fresh Chicken Fajita Mini Fillets 320g



        £2.55/unit


                                        Sainsbury's Slow Cook Fire Cracker Chicken 573g



        £4.75/unit

Thank you

Paul M. · Accepted Answer · 2020-04-13 20:03:14Z

2

If you log your network traffic and filter it to view only XHR-resources, you'll find one that talks with an AJAX web app. It talks to the server and the server yields HTML (unfortunately, not entirely JSON. It's HTML baked into a JSON response). This isn't really required, since your code seems to be scraping the page OK. It is a cuter way of getting the products, however. You also don't have to worry about things like pagination. To strip the leading and trailing whitespace, as others have already pointed out, use str.strip. In this example I'm only printing the first ten products (out of 114). And yes, I realize I could have just appended the query string to the url rather than create a params dict, but it's easier to read and make changes this way:

import requests
from bs4 import BeautifulSoup


class Product:

    def __init__(self, html):
        from bs4 import BeautifulSoup

        soup = BeautifulSoup(html, "html.parser")
        self.name, _, self.weight = soup.find("a").text.strip().rpartition(" ")
        self.price_per_unit = soup.find("p", {"class": "pricePerUnit"}).text.strip()
        self.price_per_measure = soup.find("p", {"class": "pricePerMeasure"}).text.strip()


    def __str__(self):
        return f"\"{self.name}\" ({self.weight}) - {self.price_per_unit}"

url = "https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/AjaxApplyFilterBrowseView"

params = {
    "langId": "44",
    "storeId": "10151",
    "catalogId": "10241",
    "categoryId": "310864",
    "parent_category_rn": "13343",
    "top_category": "13343",
    "pageSize": "120",
    "orderBy": "FAVOURITES_ONLY|SEQUENCING|TOP_SELLERS",
    "searchTerm": "",
    "beginIndex": "0",
    "hideFilters": "true",
    "requesttype": "ajax"
}

response = requests.get(url, params=params)
response.raise_for_status()

product_info = response.json()[4]["productLists"][0]["products"]

products = [Product(p["result"]) for p in product_info[:10]]

for product in products:
    print(product)

Output:

"Sainsbury's Chicken Thigh Fillets" (640g) - £3.40/unit
"Sainsbury's Mini Chicken Breast Fillets" (320g) - £2.00/unit
"Sainsbury's Chicken Thighs" (1kg) - £1.95/unit
"Sainsbury's Chicken Breast Fillets" (300g) - £1.70/unit
"Sainsbury's Chicken Drumsticks" (1kg) - £1.70/unit
"Sainsbury's Chicken Thigh Fillets" (320g) - £1.85/unit
"Sainsbury's Chicken Breast Diced" (410g) - £2.40/unit
"Sainsbury's Chicken Small Whole Bird" (1.35kg) - £2.80/unit
"Sainsbury's Chicken Thighs & Drumsticks" (540g) - £1.00/unit
"Sainsbury's Chicken Breast Fillets" (640g) - £3.60/unit
>>> product.price_per_measure
'£5.63/kg'
>>>

edited Apr 13, 2020 at 20:03

answered Apr 13, 2020 at 18:20

Paul M.

10.8k2 gold badges11 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ambilli Radhakrishnan Over a year ago

Hey, this seems to work better but i am not sure how i can get the price as well now. I want it to show the price of each product as well. Thank you!

Paul M. Over a year ago

@RamgithUnniJagajith I've updated the code in my answer, take a look. I've basically just created a Product class that extracts the data it's interested in from the HTML. Every product has a name, weight price_per_unit and price_per_measure.

Ambilli Radhakrishnan Over a year ago

Perfect! Thanks a lot

Darien Schettler · Accepted Answer · 2020-04-13 17:56:31Z

You could use .strip()... it removes leading and trailing spaces

>>> s = "      I'm a sentence     "
>>>s.strip()
I'm a sentence

Applied to your problem

for name_of in soup.find_all('li',class_='gridItem'):
    name = name_of.h3.a.text.strip()
    print(name)
    try:
        price = name_of.find('div', class_='product')
        pricen = price.find('div', class_='addToTrolleytabBox').p.text.strip()
        print(pricen)
        csv_writer.writerow([name, pricen])
    except:
        print('Sold Out')
        print()

csv_writer.writerow([name, pricen])
csv_file.close()

Without being able to replicate your code I can't test it. But this should work if pricen and name are string with significant trailing and leading space.

I hope this helps!

user6597761 · Accepted Answer · 2020-04-13 17:56:46Z

0

str.strip() will strip all the whitespace characters from both sides:

>>> "              a             ".strip()
'a'

Just apply this to every print statement.

answered Apr 13, 2020 at 17:56

user6597761

Collectives™ on Stack Overflow

Removing spaces between scraped data - Python

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related