0

I want to crawl the homepage of youtube to pull out all the links of videos. Following is the code

from bs4 import BeautifulSoup
import requests

s='https://www.youtube.com/'
html=requests.get(s)
html=html.text

s=BeautifulSoup(html,features="html.parser")

for e in s.find_all('a',{'id':'video-title'}):
    link=e.get('href')
    text=e.string
    print(text)
    print(link)
    print()

Nothing is happenning when I run the above code. It seems like the id is not getting discovered. What am I doing wrong

4 Answers 4

1

It is because you are not getting the same HTML as your browser have.

import requests
from bs4 import BeautifulSoup


s =  requests.get("https://youtube.com").text

soup = BeautifulSoup(s,'lxml')

print(soup)

Save this code's output to a file named test.html and run. You will see that it is not the same as the browser's, as it looks corrupted.

See these questions below.

HTML in browser doesn't correspond to scraped data in python

Python requests not giving me the same HTML as my browser is

Basically, I recommend you to use Selenium Webdriver as it reacts as a browser.

Sign up to request clarification or add additional context in comments.

Comments

1

Yes, this is a strange scrape, but if you scrape at the 'div id="content"' level, you are able to get the data you are requesting. I was able to get the titles of each video, but it appears youtube has some rate limiting or throttling, so I do not think you will be able to get ALL of the titles and links. At any rate, below is what I got working for the titles:

import requests
from bs4 import BeautifulSoup

url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')

for each in links:
    print(each.text)

1 Comment

Don't we have to give the id in the form of dictionary like this {'id':'content'}
1

May be this could help for scraping all videos from youtube home page,

    from bs4 import BeautifulSoup
    import requests

    r = 'https://www.youtube.com/'
    html = requests.get(r)

    all_videos = []

    soup = BeautifulSoup(html.text, 'html.parser')
    for i in soup.find_all('a'):
        if i.has_attr('href'):
            text = i.attrs.get('href')
            if text.startswith('/watch?'):
                urls = r+text
                all_videos.append(urls)
    print('Total Videos', len(all_videos))
    print('LIST OF VIDEOS', all_videos)

Comments

1

This code snippet will selects all links from youtube.com homepage that contains /watch? in their href attribute (links to videos):

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get('https://www.youtube.com/').text, 'lxml')

for a in soup.select('a[href*="/watch?"]'):
    print('https://www.youtube.com{}'.format(a['href']))

Prints:

https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU

...and so on

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.