1

I have a web scraping script that has been working for months but today it did not. The error occurs when calling:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = f'https://www.blocket.se/annonser/hela_sverige/fordon/bilar?cb=40&cbl1=6&cchb=1&ccsc=1&cg=1020&f=c&mye=2017&mys=2013&page=1&sort=date'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div", {"class":"styled__Wrapper-sc-1kpvi4z-0 itHtzm"})
print(len(containers))

This should be 40 elements long however it is:

0

Now the following command used to get all the wanted containers however now it finds nothing. By printing the page_soup variable I found that the class had changed name to gSWafH instead of itHtzm.

containers = page_soup.findAll("div", {"class":"styled__Wrapper-sc-1kpvi4z-0 gSWafH"})
print(len(containers))

Instead gives the wanted:

40

Similar changes were true for all classes and I first thought that the website had changed. However, if I read the HTML code on the website myself nothing has changed.

Why is there a difference between the HTML code found by manually going to the site and viewing the HTML code in the browser and reading it using BS4?

I know that I could change all of the class names/searches to fix the script however it's a rather long script and I would much prefer to know the cause of the difference.

2 Answers 2

1

Do you realize that the entire site you're scraping is behind JS? Having said that, bs4 won't see a thing, which explains your result. In other words, there's no such content in the HTML you're getting back.

But, there's an API that you can query.

Here's how:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
    "Authorization": "Bearer 2381c6b987aea877abc6a73fe1cbc7d4a88a602c",
}

api = "https://api.blocket.se/search_bff/v1/content?cb=40&cbl1=6&cchb=1&ccsc=1&cg=1020&f=c&lim=40&mye=2017&mys=2013&page=0&sort=date"
response = requests.get(api, headers=headers).json()
print(len(response["data"]))

This should print 40, which is the number of offers.

Note: I'm not sure yet where the Bearer string is coming from, but I'll dig deeper and get back to this answer.

By the way, all the data you want comes back in that JSON so for example you can do this:

import requests
from tabulate import tabulate

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
    "Authorization": "Bearer 2381c6b987aea877abc6a73fe1cbc7d4a88a602c",
}

api = "https://api.blocket.se/search_bff/v1/content?cb=40&cbl1=6&cchb=1&ccsc=1&cg=1020&f=c&lim=40&mye=2017&mys=2013&page=0&sort=date"
response = requests.get(api, headers=headers).json()
print(len(response["data"]))

listing = [
    [
        i["subject"],
        i["price"]["value"],
        i["share_url"]
    ] for i in response["data"]
]

print(tabulate(listing, headers=["Name", "Price", "URL"]))

To get this:

Name                                                            Price  URL
------------------------------------------------------------  -------  --------------------------------------
Volkswagen Golf R 2.0 4Motion DSG Euro 6 300hk                 295000  https://www.blocket.se/vi/94476197.htm
Volkswagen Golf 1.6 TDI Style Eu6 110hk Nyservad               104900  https://www.blocket.se/vi/94476140.htm
Volkswagen Golf 1.4 150hk R-Line Nyservad 2Brukare 4.5/100km   173700  https://www.blocket.se/vi/94393769.htm
Volkswagen Golf GTD 2.0 TDI DSG Euro 6 184hk                   189000  https://www.blocket.se/vi/92041858.htm
Volkswagen Golf 1.4 TSI DSG R-LINE APPLE CARPLAY 150hk         179000  https://www.blocket.se/vi/94258178.htm
Volkswagen Golf R 2.0 4MOTION DSG MILLTEK PANORAMA GPS 300hk   289000  https://www.blocket.se/vi/94067452.htm
Volkswagen Golf 1.0 TSI BLUEMOTION DSG LÅGMIL 110hk            169000  https://www.blocket.se/vi/93849586.htm
Volkswagen Golf R 2.0 4M DSG Panorama Facelift Euro 6 310hk    329900  https://www.blocket.se/vi/94475139.htm
Volkswagen Golf TSi 140hk DSG /R-Line/Drag/Mok                 164900  https://www.blocket.se/vi/94473981.htm
Volkswagen Golf e-Golf 24.2 kWh /V-hjul/Navigation             199900  https://www.blocket.se/vi/91366004.htm
Volkswagen Golf Sportsvan 1.2 TSI DSG AUT Style EU6            159900  https://www.blocket.se/vi/93688078.htm
Volkswagen Golf 1.4 TSI Navi Drag Fullservad 140hk             122800  https://www.blocket.se/vi/94196494.htm
Volkswagen Golf GTE Hybrid 204hk B-kamera Drag Fullservad      199800  https://www.blocket.se/vi/94302825.htm
Volkswagen Golf 1.6 TDI 110HK Fjärrstyrd värmare               129000  https://www.blocket.se/vi/94433465.htm
Volkswagen Golf 1.2 TSI 110HK Aut Årsskatt 382kr               147500  https://www.blocket.se/vi/94057577.htm
Volkswagen Golf GTE 1.4 TSI DSG Sekventiell Euro 6 204hk       174900  https://www.blocket.se/vi/94469939.htm
Volkswagen Golf Alltrack 4M Eu6 184hk Premium D-värm B-kamer   164400  https://www.blocket.se/vi/94240149.htm
Volkswagen Golf 1.2 TSI BlueMotion 105hk Nyservad               84800  https://www.blocket.se/vi/94466632.htm
Volkswagen Golf 1.2 TSI 110 | Style | Komplett servicebok      114500  https://www.blocket.se/vi/92133274.htm
Volkswagen Golf 5-dr P-sensor 1.6 TDI 115hk                    144800  https://www.blocket.se/vi/93528128.htm
Volkswagen Golf 5-door R 2.0 4Motion DSG Skinn Pano Dynaudio   268900  https://www.blocket.se/vi/94134128.htm
Volkswagen Golf 1.6TDI M-värm Drag Eu6 110hk                   129800  https://www.blocket.se/vi/93438018.htm
Volkswagen Golf 1.2 TSI Fullservad 105hk                       104800  https://www.blocket.se/vi/94342484.htm
Volkswagen Golf 1.4 TSI Multifuel | Style | M-Värme | 5-dorr   116800  https://www.blocket.se/vi/93526752.htm
Volkswagen Golf GTI 2.0 TSI 230hk Performance Euro6            239900  https://www.blocket.se/vi/91536231.htm
Volkswagen Golf 140hk / Highline Plus                          114900  https://www.blocket.se/vi/89194496.htm
Volkswagen Golf 1.4 TSI / Style / 5-dörrar                      99000  https://www.blocket.se/vi/93718402.htm
Volkswagen Golf Sportsvan 1.2 TSI DSG 12 månaders garanti      169900  https://www.blocket.se/vi/93729451.htm
Volkswagen Golf Sportsvan TSI 110 MASTERS                      115000  https://www.blocket.se/vi/94455199.htm
Volkswagen Golf 1.2 TSI VÄLVÅRDAD STYLE 105hk                   89900  https://www.blocket.se/vi/94455152.htm
Volkswagen Golf 5-dörrar 1.6 TDI BlueMotion Design sport        72900  https://www.blocket.se/vi/94454862.htm
Volkswagen Golf 1.6 TDI Aut | Darklabel | P-värmare             99000  https://www.blocket.se/vi/94131096.htm
Volkswagen Golf 5-dr GTI Performance 2.0 TSI Eu 6 245hk        234500  https://www.blocket.se/vi/94453627.htm
Volkswagen Golf 5-d GTD 2.0 TDI Premium Kamera/Värmare/Drag    169000  https://www.blocket.se/vi/94453171.htm
Volkswagen Golf GTD 2.0 184hk D-Värmare Dynaudio Välservad     174900  https://www.blocket.se/vi/94452825.htm
Volkswagen Golf 1.6TDI 105hk Style P.sensor Välservad          127900  https://www.blocket.se/vi/94452680.htm
Volkswagen Golf TDI 150hk R-Line DSG / 1.99% Ränta             195000  https://www.blocket.se/vi/93752165.htm
Volkswagen Golf 5-dörrar 1.6 TDI Design, Style 105hk            89900  https://www.blocket.se/vi/94448403.htm
Volkswagen Golf 5-dörrar 1.6 TDI DSG Sekventiell Style 105hk   119900  https://www.blocket.se/vi/94448402.htm
Volkswagen Golf 5-door R 2.0 4Motion DSG 310hk                 309000  https://www.blocket.se/vi/94126802.htm
Sign up to request clarification or add additional context in comments.

11 Comments

Thanks, this answer is however far more advanced than my knowledge of the subject, what does JS mean in this context? It worked just fine just the other day as it is, from that page I specifically need the address to the actual offering so the "to" address. I understand that the JSON approach would be better now in hindsight however this is what I came up with from a simple tutorial and it has been great for over 8 mounts up until now...
JS stands for JavaScript. You approach used to work because the site might have been more "static", meaning less content was rendered dynamically by JavaScript. It seems like that site has changed, rendering your script useless. And even if it didn't, there's still a possibility that the class names are going to be random, which makes your script hard to maintain, as you need to periodically check the source code of the page. An API approach has none of those problems.
I've edited the answer so you can see the "go to" url now too.
Thank you so much, from what I can tell, the respons even has information that I had to get by first going onto every offering one by one, is this correct or am I missing something? It seems as though you are only going to the "view all offersings" page and from there you can read the description of the car even though you have not actually visited the offering for the specific car.
One more quick question, is there a nice way to go through all pages or it the best way to simply change the api-string and request the information for one page at the time?
|
1

EDIT: Actually the guy above me is right, I didn't look into whether it was loaded in via JS since I thought you were able to scrape via normal requests just fine. This solution doesn't work if its dynamic.

I just checked myself and it seems that the site does change the classnames to random variables. Do you only need to get the data of each ad? Then you could try and just get the first class. By the looks of your code that one doesn't change:

containers = page_soup.findAll("div", {"class":"styled__Wrapper-sc-1kpvi4z-0"})

This gets the same amount of results as with the random class added. By checking in the dev console there is a slider at the top which explains why.

I do wonder how you managed to get 40 results with those 2 classes though, since I get 47.

1 Comment

But how does it change randomly, I mean if I go to the webpage and analyze the HTLM code through "view elements" (rough translation) I get the same class name all the time, but the BS command return similar but not exactly the same code as I can view myself...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.