Web scraping multiple pages in python and writing it into a csv file

Question

I am new to web scraping and I am trying to scrape all the video links from each page of this specific site and writing that into a csv file. For starters I am trying to scrape the URLs from this site:

https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3

and going through all 19 pages. The problem I'm encountering is that the same 20 video links are being written 19 times(because I'm trying to go through all 19 pages), instead of having (around) 19 distinct sets of URLs.

import requests 
from bs4 import BeautifulSoup
from csv import writer 

def make_soup(url): 
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def scrape_url():
    for video in soup.find_all('a', class_='img-anchor'):
        link = video['href'].replace('//','')
        csv_writer.writerow([link])

with open("videoLinks.csv", 'w') as csv_file:
        csv_writer = writer(csv_file)
        header = ['URLS']
        csv_writer.writerow(header)

        url = 'https://search.bilibili.com/all?keyword=%E3%82%A2%E3%83%8B%E3%82%B2%E3%83%A9%EF%BC%81%E3%83%87%E3%82%A3%E3%83%89%E3%82%A5%E3%83%BC%E3%83%BC%E3%83%B3'
        soup = make_soup(url)

        lastButton = soup.find_all(class_='page-item last')
        lastPage = lastButton[0].text
        lastPage = int(lastPage)
        #print(lastPage)

        page = 1
        pageExtension = ''

        scrape_url()

        while page < lastPage:
            page = page + 1
            if page == 1:
                pageExtension = ''
            else:
                pageExtension = '&page='+str(page)
            #print(url+pageExtension)
            fullUrl = url+pageExtension
            make_soup(fullUrl)
            scrape_url()

Any help is much appreciated and I decided to code this specific way so that I can better generalize this throughout the BiliBili site.

A screenshot is linked below showing how the first link repeats a total of 19 times:

Screenshot of csv file

Coder1238 · Accepted Answer · 2020-06-04 03:16:08Z

1

Try

soup = make_soup(fullurl)

in last but one line

answered Jun 4, 2020 at 3:16

Coder1238

363 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nathan Neibauer · Accepted Answer · 2020-06-04 03:18:27Z

0

In the second to last line, you are not assigning the return value of make_soup. In your scrape_url function, you are using a variable called soup, but that only gets assigned once.

If you changed this line to soup = scrape_url() then it should work.

answered Jun 4, 2020 at 3:18

Nathan Neibauer

615 bronze badges

1 Comment

spicadox Over a year ago

Ah, thank you! Made a rookie mistake there! Thanks for the explanation!

Collectives™ on Stack Overflow

Web scraping multiple pages in python and writing it into a csv file

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related