0

I am new to web scraping and I am trying to scrape all the video links from each page of this specific site and writing that into a csv file. For starters I am trying to scrape the URLs from this site:

https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3

and going through all 19 pages. The problem I'm encountering is that the same 20 video links are being written 19 times(because I'm trying to go through all 19 pages), instead of having (around) 19 distinct sets of URLs.

import requests 
from bs4 import BeautifulSoup
from csv import writer 

def make_soup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def scrape_url():
    for video in soup.find_all('a', class_='img-anchor'):
        link = video['href'].replace('//','')
        csv_writer.writerow([link])

with open("videoLinks.csv", 'w') as csv_file:
        csv_writer = writer(csv_file)
        header = ['URLS']
        csv_writer.writerow(header)

        url = 'https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3'
        soup = make_soup(url)

        lastButton = soup.find_all(class_='page-item last')
        lastPage = lastButton[0].text
        lastPage = int(lastPage)
        #print(lastPage)

        page = 1
        pageExtension = ''

        scrape_url()

        while page < lastPage:
            page = page + 1
            if page == 1:
                pageExtension = ''
            else:
                pageExtension = '&page='+str(page)
            #print(url+pageExtension)
            fullUrl = url+pageExtension
            make_soup(fullUrl)
            scrape_url()

Any help is much appreciated and I decided to code this specific way so that I can better generalize this throughout the BiliBili site.

A screenshot is linked below showing how the first link repeats a total of 19 times:

Screenshot of csv file

2 Answers 2

1

Try

soup = make_soup(fullurl)

in last but one line

Sign up to request clarification or add additional context in comments.

Comments

0

In the second to last line, you are not assigning the return value of make_soup. In your scrape_url function, you are using a variable called soup, but that only gets assigned once.

If you changed this line to soup = scrape_url() then it should work.

1 Comment

Ah, thank you! Made a rookie mistake there! Thanks for the explanation!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.