1

I'm trying to scrape a data from a table on a website. However, I am continuously running into "ValueError: cannot set a row with mismatched columns".

The set-up is:

url = 'https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table1 = soup.find('div', id = 'content')

headers = []
for i in table1.find_all('th'):
    title = i.text
    headers.append(title)

my_data = pd.DataFrame(columns = headers)
my_data = my_data.iloc[:,:-4]

Here, I was able to make an empty dataframe with headers same as the table (I did iloc because there were some repeating columns at the end).

Now, I wanted to fill in the empty dataframe through:

for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(my_data)
    my_data.loc[length] = row

However, as mentioned, I get "ValueError: cannot set a row with mismatched columns" in this line: length = len(my_data). I would really appreciate any help to solve this problem and to fill in the empty dataframe.

Thanks in advance.

2 Answers 2

1

Rather than trying to fill an empty DataFrame, it would be simpler to utilize .read_html, which returns a list of DataFrames after parsing every table tag within the HTML.

Even though this page has only two tables ("Top Youtube channels" and "Top Youtube channels - detail stats"), 3 DataFrames are returned because the second table is split into two table tags between rows 12 and 13 for some reason; but they can all be combined into DataFrame.

dfList = pd.read_html(url) # OR
# dfList = pd.read_html(page.text) # OR
# dfList = pd.read_html(soup.prettify())

allTime = dfList[0].set_index(['rank', 'Youtuber'])

# (header row in 1st half so 2nd half reads as headerless to pandas)
dfList[2].columns = dfList[1].columns 
perYear = pd.concat(dfList[1:]).set_index(['rank', 'Youtuber'])


columns_ordered = [
    'started', 'category', 'subscribers', 'subscribers/year', 
    'video views', 'Video views/Year', 'video count', 'Video count/Year'
] # re-order columns as preferred
combinedDf = pd.concat([allTime, perYear], axis='columns')[columns_ordered]

If the [columns_ordered] part is omitted from the last line, then the expected column order would be 'subscribers', 'video views', 'video count', 'category', 'started', 'subscribers/year', 'Video views/Year', 'Video count/Year'.

combinedDf should look like opdf

Sign up to request clarification or add additional context in comments.

Comments

1

You can try to use pd.read_html to read the table into a dataframe:

import pandas as pd

url = "https://kr.youtubers.me/united-states/all/top-500-youtube-channels-in-united-states/en"

df = pd.read_html(url)[0]
print(df)

Prints:

     rank                                            Youtuber  subscribers  video views  video count              category  started
0       1                                   ✿ Kids Diana Show    106000000  86400421379         1052        People & Blogs     2015
1       2                                          Movieclips     58500000  59672883333        39903      Film & Animation     2006
2       3                                        Ryan's World     34100000  53568277882         2290         Entertainment     2015
3       4                                     Toys and Colors     38300000  44050683425          901         Entertainment     2016
4       5   LooLoo Kids - Nursery Rhymes and Children's Songs     52200000  30758617681          605                 Music     2014
5       6                                            LankyBox     22500000  30147589773         6913                Comedy     2016
6       7                                          D Billions     24200000  27485780190          582                   NaN     2019
7       8                   BabyBus - Kids Songs and Cartoons     31200000  25202247059         1946             Education     2016
8       9                                              FGTeeV     21500000  23255537029         1659                Gaming     2013

...and so on.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.