How to bypass bot detection and scrape a website using python

Question

The problem

I was new to web scraping and I was trying to create a scraper which looks at a playlist link and gets the list of the music and the author.

But the site kept rejecting my connection because it thought that I was a bot, so I used UserAgent to create a fake useragent string to try and bypass the filter.

It sort of worked? But the problem was that when you visited the website by a browser, you could see the contents of the playlist, but when you tried to extract the html code with requests, the contents of the playlist was just a big blank space.

Mabye I have to wait for the page to load? Or there is a stronger bot filter?

My code

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

melon_site="http://kko.to/IU8zwNmjM"

headers = {'User-Agent' : ua.random}
result = requests.get(melon_site, headers = headers)


print(result.status_code)
src = result.content
soup = BeautifulSoup(src,'html.parser')
print(soup)

Link of website

playlist link

html I get when using requests

html with blank space where the playlist was supposed to be

Does this answer your question? How to Bypass Google Recaptcha while scraping with Requests? — Joshua Varghese
– Joshua Varghese, Commented Apr 24, 2020 at 3:42
Nope doesn't work. Showing a 404 error when trying to use google cache — Andy_ye
– Andy_ye, Commented Apr 24, 2020 at 4:14
BTW: if page use JavaScript to add element then you can't get it directly using requests/BS because they don't run JavaScript. You may need Selenium to control real web page which can run JavaScript. OR you may try to find url used by JavaScript to get data from server and use this url with requests — furas
– furas, Commented Apr 24, 2020 at 14:57

SIM · Accepted Answer · 2020-04-24 05:18:25Z

5

You wanna check out this link to get the content you wish to grab.

The following attempt should fetch you the artist names and their song names.

import requests
from bs4 import BeautifulSoup

url = 'https://www.melon.com/mymusic/playlist/mymusicplaylistview_listSong.htm?plylstSeq=473505374'

r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("tr:has(#artistName)"):
    artist_name = item.select_one("#artistName > a[href*='goArtistDetail']")['title']
    song = item.select_one("a[href*='playSong']")['title']
    print(artist_name,song)

Output are like:

Martin Garrix - 페이지 이동 Used To Love (feat. Dean Lewis) 재생 - 새 창
Post Malone - 페이지 이동 Circles 재생 - 새 창
Marshmello - 페이지 이동 Here With Me 재생 - 새 창
Coldplay - 페이지 이동 Cry Cry Cry 재생 - 새 창

Note: your BeautifulSoup version should be 4.7.0 or later in order for the script to support pseudo selector.

answered Apr 24, 2020 at 5:18

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

SIM Over a year ago

Select xhr tab within network section in chrome dev tools. Now, reload the page and you should find the link yourself.

Ritesh Karwa Over a year ago

is there a to bypass whitepages.com? the above trick did not work

Sharyar Vohra · Accepted Answer · 2020-04-24 04:46:32Z

4

POINT TO REMEMBERS WHILE SCRAPING

1)Use a good User Agent.. ua.random may be returning you a user agent which is being Blocked by the server

2) If you are Doing Too much scraping, limit down your scraping pace , use time.sleep() so that server may not get loaded by your Ip address else it will block you.

3) If server blocks you try using Ip rotating.

answered Apr 24, 2020 at 4:46

Sharyar Vohra

3062 silver badges13 bronze badges

3 Comments

Andy_ye Over a year ago

1. Already tried every user agent I can find on the web, Including my own, not working, 2. Not doing too much scraping, there isn't even a loop 3. Ther server didn't block my ip adsress because in a normal browser, the website opens fine

Andy_ye Over a year ago

Any other ideas?

jogaco Over a year ago

Probably detecting you dont run the page's javascript

Gilles Quénot · Accepted Answer · 2023-03-18 00:47:18Z

1

That's because the playlist is being loaded via javascript API calls AFTER the page has been loaded into an actual web browser and the document.ready event is called. At a quick glance it's probably "/mymusic/common/mymusiccommon_copyPlaylist.json".
BeautifulSoup does static page manipulation - ie. it will download the html and load it into a DOM to make extraction of data easier, IT WILL NOT RUN DYNAMIC WEBPAGES. You need to use a headless web browser for that like Selenium, pptr or Playwright that will run the javascript code and do all the myriad of EXTRA calls all websites do to fetch the rest of the actual website. I don't think they are actually using ANY bot detection.

edited Mar 18, 2023 at 0:47

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

answered Dec 30, 2022 at 12:16

LordWabbit

5174 silver badges6 bronze badges

Collectives™ on Stack Overflow

How to bypass bot detection and scrape a website using python

The problem

My code

Link of website

html I get when using requests

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The problem

My code

Link of website

html I get when using requests

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related