1

i am trying to scrape a table that spans multiple pages and export to a csv file. only one line of data seems to get exported and it is jumbled up.

I have looked on the web and tried many iterations and very frustrated now. As you can tell from code I am a novice at coding!

import bs4 as bs
import urllib.request
import pandas as pd
import csv


max_page_num = 14
max_page_dig = 1 # number of digits in the page number


with open('result.csv',"w") as f:
    f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")

for i in range(0, max_page_num):  
    page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
    print(page_num)
    source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
    print(source)

    url = urllib.request.urlopen(source).read()    

    soup = bs.BeautifulSoup(url,'lxml')
    table = soup.find('table')
    table_rows = table.find_all('tr')

    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        #final = row.strip("\n")
        #final = row.replace("\n","")
        with open('result.csv', 'a') as f:
            f.write(row)

It seems when I write to csv it overwrites previous ones. It also pastes it on one line and the players name is concatenated with the school name . Thanks for any and all help.

1 Answer 1

1

I think you have a problem with your inside for loop. Try re-writing it as

with open('result.csv', 'a') as f:
   for tr in table_rows:
      td = tr.find_all('td')
      row = [i.text for i in td]
      f.write(row)

and see if it works.

More generally, this can probably be done more simply by using pandas. Try changing your for loop to:

for i in range(0, max_page_num):  
   page_num = ...
   source = ....
   df = pd.read_html(source)
   df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
Sign up to request clarification or add additional context in comments.

2 Comments

Hi Jack (fellow NY'er). Thanks so much for the effort. Using your first correction method it does indeed download them all but unfortunately puts them all on one line with /n's. The second method I got error 'AttributeError: 'list' object has no attribute 'to_csv''.
@HermanL - Small World! Anyway, df = pd.read_html(source) may result in more than one table, depending on the page - that's the 'list' the error is referring to. To see how many tables you snagged, add print(len(df)). Try writing to csv like df[0].to_csv(...).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.