How can I use Python to scrape a multipage table and export to a CSV file?

Question

i am trying to scrape a table that spans multiple pages and export to a csv file. only one line of data seems to get exported and it is jumbled up.

I have looked on the web and tried many iterations and very frustrated now. As you can tell from code I am a novice at coding!

import bs4 as bs
import urllib.request
import pandas as pd
import csv


max_page_num = 14
max_page_dig = 1 # number of digits in the page number


with open('result.csv',"w") as f:
    f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")

for i in range(0, max_page_num):  
    page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
    print(page_num)
    source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
    print(source)

    url = urllib.request.urlopen(source).read()    

    soup = bs.BeautifulSoup(url,'lxml')
    table = soup.find('table')
    table_rows = table.find_all('tr')

    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        #final = row.strip("\n")
        #final = row.replace("\n","")
        with open('result.csv', 'a') as f:
            f.write(row)

It seems when I write to csv it overwrites previous ones. It also pastes it on one line and the players name is concatenated with the school name . Thanks for any and all help.

Jack Fleeting · Accepted Answer · 2019-10-28 10:51:39Z

1

I think you have a problem with your inside for loop. Try re-writing it as

with open('result.csv', 'a') as f:
   for tr in table_rows:
      td = tr.find_all('td')
      row = [i.text for i in td]
      f.write(row)

and see if it works.

More generally, this can probably be done more simply by using pandas. Try changing your for loop to:

for i in range(0, max_page_num):  
   page_num = ...
   source = ....
   df = pd.read_html(source)
   df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.

answered Oct 28, 2019 at 10:51

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Herman L Over a year ago

Hi Jack (fellow NY'er). Thanks so much for the effort. Using your first correction method it does indeed download them all but unfortunately puts them all on one line with /n's. The second method I got error 'AttributeError: 'list' object has no attribute 'to_csv''.

Jack Fleeting Over a year ago

@HermanL - Small World! Anyway, df = pd.read_html(source) may result in more than one table, depending on the page - that's the 'list' the error is referring to. To see how many tables you snagged, add print(len(df)). Try writing to csv like df[0].to_csv(...).

Collectives™ on Stack Overflow

How can I use Python to scrape a multipage table and export to a CSV file?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related