0

Here's link for scraping : http://5000best.com/websites/Games/

I tried almost everything I can. I'm a beginner in web scraping.

My code :

import requests
from  urllib.request import  urlopen
from urllib.error import  HTTPError
from urllib.error import  URLError
from bs4 import  BeautifulSoup
import pandas as pd
import csv


try:
    html = urlopen("http://5000best.com/websites/Games/")

except HTTPError as e:
    print(e)

except URLError as u:
    print(u)

else:
    soup = BeautifulSoup(html,"html.parser")
    table = soup.findAll('div',{"id":"content"})[0]
    tr = table.findAll(['tr'])[0:]
    csvFile = open('games.csv','wt', newline='',encoding='utf-8')
    writer = csv.writer(csvFile)
    try:   
        for cell in tr:
            th = cell.find_all('th')
            th_data = [col.text.strip('\n') for col in th]
            td = cell.find_all('td')
            row = [i.text.replace('\n','') for i in td]
            writer.writerow(th_data+row)      

    finally:   
        csvFile.close()

This code only scrape the first page of the table... I want all the pages. I inspected the web page but I didn't saw any url changes while toggling the page numbers, So it's completely dynamic.

3 Answers 3

2

You can read it directly using pandas.read_html() function as a DataFrame which will do it easily for you.

import pandas as pd


def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(df)


main("http://5000best.com/websites/Games/{}/")

Sample of output:

enter image description here

CSV edit:

import pandas as pd


def main(url):
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        print(f"Saving Page {item}")
        df.to_csv(f"page{item}.csv", index=False)


main("http://5000best.com/websites/Games/{}/")

Code updated for single DataFrame:

import pandas as pd


def main(url):
    goal = []
    for item in range(1, 4):
        df = pd.read_html(url.format(item))[1]
        goal.append(df)
    final = pd.concat(goal)
    print(final)


main("http://5000best.com/websites/Games/{}/")
Sign up to request clarification or add additional context in comments.

10 Comments

It's an easy way though... but it will scrape only first page of the table. As the table is paginated, I'm looking for a way to scrape all the paginated table data.
@HemantSah seems you didn't tried to run the code yet. be informed it's will paginate all tables. that's why i made a loop !
@ αԋɱҽԃ αмєяιcαη , The more I see your solution the more fascinated I am.
@HumayunAhmadRajib glad to help :)
@HemantSah I've updated the answer for you to save the tables within csv files.
|
0

Looking at the network inspector for that page reveals that it makes requests to

when you change pages. You may want to just scrape those instead.

1 Comment

Is there any way to store all these links into list or dictionary automatically without inspecting every table.
0

Let me try to help you understand.

Have you used the developer tools in your browser? Open that (Use F12 or right click > inspect element) and select the network tab. Now while keeping the tab open, click on the next page link. A request shows up in the Network Tab.

This is what you are looking for. All dynamic thing on a web page can be viewed here.

Hope this helps you learn something. Cheers!

3 Comments

Is there any way to automate this to store all these links into list or dictionary. I want to scrape the table for all categories i.e. Games, Commerce, Music etc. OR do I have to store it manually..?
You can run 2 loops, one for each category, then inside this loop, run an iterative loop for each page. If the page you hit has no links at all, you may break the inner loop as this means you have reached the end of that category.
I didn't get it... I want to make a list of all these dynamic links from every category. Is this possible... OR do I have to store all these links by visiting every table by myself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.