7

The problem

I was new to web scraping and I was trying to create a scraper which looks at a playlist link and gets the list of the music and the author.

But the site kept rejecting my connection because it thought that I was a bot, so I used UserAgent to create a fake useragent string to try and bypass the filter.

It sort of worked? But the problem was that when you visited the website by a browser, you could see the contents of the playlist, but when you tried to extract the html code with requests, the contents of the playlist was just a big blank space.

Mabye I have to wait for the page to load? Or there is a stronger bot filter?

My code

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

melon_site="http://kko.to/IU8zwNmjM"

headers = {'User-Agent' : ua.random}
result = requests.get(melon_site, headers = headers)


print(result.status_code)
src = result.content
soup = BeautifulSoup(src,'html.parser')
print(soup)

Link of website

playlist link

html I get when using requests

html with blank space where the playlist was supposed to be

4
  • Does this answer your question? How to Bypass Google Recaptcha while scraping with Requests? Commented Apr 24, 2020 at 3:42
  • try using the same method Commented Apr 24, 2020 at 3:43
  • Nope doesn't work. Showing a 404 error when trying to use google cache Commented Apr 24, 2020 at 4:14
  • BTW: if page use JavaScript to add element then you can't get it directly using requests/BS because they don't run JavaScript. You may need Selenium to control real web page which can run JavaScript. OR you may try to find url used by JavaScript to get data from server and use this url with requests Commented Apr 24, 2020 at 14:57

3 Answers 3

5

You wanna check out this link to get the content you wish to grab.

The following attempt should fetch you the artist names and their song names.

import requests
from bs4 import BeautifulSoup

url = 'https://www.melon.com/mymusic/playlist/mymusicplaylistview_listSong.htm?plylstSeq=473505374'

r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select("tr:has(#artistName)"):
    artist_name = item.select_one("#artistName > a[href*='goArtistDetail']")['title']
    song = item.select_one("a[href*='playSong']")['title']
    print(artist_name,song)

Output are like:

Martin Garrix - 페이지 이동 Used To Love (feat. Dean Lewis) 재생 - 새 창
Post Malone - 페이지 이동 Circles 재생 - 새 창
Marshmello - 페이지 이동 Here With Me 재생 - 새 창
Coldplay - 페이지 이동 Cry Cry Cry 재생 - 새 창

Note: your BeautifulSoup version should be 4.7.0 or later in order for the script to support pseudo selector.

Sign up to request clarification or add additional context in comments.

2 Comments

Select xhr tab within network section in chrome dev tools. Now, reload the page and you should find the link yourself.
is there a to bypass whitepages.com? the above trick did not work
4

POINT TO REMEMBERS WHILE SCRAPING


1)Use a good User Agent.. ua.random may be returning you a user agent which is being Blocked by the server

2) If you are Doing Too much scraping, limit down your scraping pace , use time.sleep() so that server may not get loaded by your Ip address else it will block you.

3) If server blocks you try using Ip rotating.

3 Comments

1. Already tried every user agent I can find on the web, Including my own, not working, 2. Not doing too much scraping, there isn't even a loop 3. Ther server didn't block my ip adsress because in a normal browser, the website opens fine
Any other ideas?
Probably detecting you dont run the page's javascript
1

That's because the playlist is being loaded via javascript API calls AFTER the page has been loaded into an actual web browser and the document.ready event is called. At a quick glance it's probably "/mymusic/common/mymusiccommon_copyPlaylist.json".
BeautifulSoup does static page manipulation - ie. it will download the html and load it into a DOM to make extraction of data easier, IT WILL NOT RUN DYNAMIC WEBPAGES. You need to use a headless web browser for that like Selenium, pptr or Playwright that will run the javascript code and do all the myriad of EXTRA calls all websites do to fetch the rest of the actual website. I don't think they are actually using ANY bot detection.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.