Removing specific strings from Python Webscraping Results

Question

I'm new to web scraping and am currently trying out this block of code

import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")

names = soup.find_all('h2') #name of food
rest = soup.find_all('span', {'class' : 'amount'}) # price of food

for div, a in zip(names, rest):
    print(div.text, a.text) # print name / price in same line

It works great except for one problem that I will show in the link below

printing result of 2 for loops in same line

Beside the string "HONEY GLAZED CHICKEN WING" is a $0.00 which is an outlier returned as a result of the shopping cart app on the website (it shares the span class='amount').

How would I remove this string and "move up" the other prices so that they are now in line and correspond with the names of the food

Edit: Sample output below

 Line1: HONEY GLAZED CHICKEN WING $0.00
 Line2: CRISPY CHICKEN LUNCH BOX
 Line3:                                                    $5.00
 Line4: BREADED FISH LUNCH BOX
 Line5:                                                    $5.00

My desired output would be something like:

 Line1: HONEY GLAZED CHICKEN WING                          $5.00
 Line2: CRISPY CHICKEN LUNCH BOX                           $5.00

I'm looking for a solution that removes the outlying $0.00 and moves the rest of the prices up

please paste a short and representative sample of your current output, as well as your intended output. Otherwise no one will get what you want. — sudonym
– sudonym, Commented Jun 7, 2018 at 4:37

Jay Calamari · Accepted Answer · 2018-06-07 04:53:27Z

1

I think you might have asked the wrong question. You can eliminate the $0.00 outlier, but your results for the prices still won't match up with the names.

To be sure that your list of prices and and names are in the same order, so they match up, it might be easier to search for the divs that contain both of them first:

import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")

# all the divs that held the foods had this same style
divs = soup.find_all('div', {'style': 'max-height:580px;'})
names_and_prices = {
    # name: price
    div.find('h2').text: div.find('span', {'class': 'amount'}).text
    for div in divs
}
for name, price in names_and_prices.items():
    print(name, price)

answered Jun 7, 2018 at 4:53

Jay Calamari

6531 gold badge6 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

sgeza Over a year ago

thanks this was what i was looking for, I ran this block of code but it didn't put it on the same line though, do u know what i'm missing? The output is exactly as the one in my post, except it didn't include the $0.00

Jay Calamari Over a year ago

Hey, it's doing that cause the price string from the span tag has a bunch of whitespace on both sides. Python actually has a function to get rid of exactly that called strip(). Try changing div.find('span', {'class': 'amount'}).text to div.find('span', {'class': 'amount'}).text.strip() with the .strip() at the end. PS, in the desired output you posted, it says the crispy chicken lunch box is $5.00. It's not on the site, it's $4.50. That's why I was saying be careful with the order :p

Jay Calamari Over a year ago

I guess, printed that way, it won't line up in columns like you showed.

sgeza Over a year ago

Really appreciate your help! Thanks man, totally fixed everything with just that one text.strip()

Jay Calamari Over a year ago

Np! (Just PPS, if you wanted to print in nice columns, you can change the print to print("{: <50} {: >5}".format(name, price)). Basically the name and price get put in between the { } brackets, and the 50/5 give a minimum width for the two strings so that it ends up being in columns. Dunno if you need that at all tho.)

|

SIM · Accepted Answer · 2018-06-07 11:15:21Z

1

To get the output the way you have mentioned above, you can try like below:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")

for items in soup.find_all(class_='product-cat-lunch-boxes'):
    name = items.find("h2").get_text(strip=True)
    price = items.find(class_="amount").get_text(strip=True)
    print(name,price)

Results are like:

HONEY GLAZED CHICKEN WING LUNCH BOX $5.00
CRISPY CHICKEN LUNCH BOX $4.50
BREADED FISH LUNCH BOX $4.50
EGG OMELETTE LUNCH BOX $4.50
FRIED TWO-JOINT WING LUNCH BOX $4.50

answered Jun 7, 2018 at 11:15

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Comments

sudonym · Accepted Answer · 2018-06-07 04:45:22Z

0

try this:

for div, a in zip(names, rest):
    if a.text.strip() and '$0.00' not in a.text: # empty strings are False
        print(div.text, a.text) # print name / price in same line
    else:                       # optional
         print 'Outlier'        # optional

Keep in mind this will ONLY work for outliers that contain '$0.00' in a.text.

answered Jun 7, 2018 at 4:45

sudonym

4,0384 gold badges40 silver badges63 bronze badges

Collectives™ on Stack Overflow

Removing specific strings from Python Webscraping Results

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related