How to extract table from website using python

Question

i have been trying to extract the table from website but i am lost. can anyone help me ? my goal is to extract the table of scope page : https://training.gov.au/Organisation/Details/31102

import requests
from bs4 import BeautifulSoup
url = "https://training.gov.au/Organisation/Details/31102"
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'lxml')

table = soup.find(id ="ScopeQualification")
[row.text.split() for row in table.find_all("tr")]

This web page loads data with javascript so you need a browser in there. Using selenium would be the only way to get this info. pypi.org/project/selenium — Paul Brennan
– Paul Brennan, Commented Jan 8, 2021 at 2:37

Ferris · Accepted Answer · 2021-01-13 05:40:33Z

2

find OrganisationId from 'https://training.gov.au/Organisation/Details/31102'.
find XHR url, https://training.gov.au/Organisation/AjaxScopeQualification/3fbfd9c9-3cce-4d69-973e-4e2674f8c5a9?tabIndex=4, POST Method.

import requests
import json
import pandas as pd
import re

def get_organisationId(url):
    # url = 'https://training.gov.au/Organisation/Details/31102'
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36'}
    resp = requests.get(url, headers=headers)
    id_list = re.findall(r'OrganisationId=(.*?)&', resp.text)
    organisationId = id_list[0] if id_list else None
    return organisationId

# get organisationId first
url = 'https://training.gov.au/Organisation/Details/31102'
organisationId = get_organisationId(url)


def get_AjaxScopeQualification(organisationId):
    if organisationId:
        url = f'https://training.gov.au/Organisation/AjaxScopeQualification/{organisationId}?tabIndex=4'
        headers = {
         'origin': 'https://training.gov.au',
         'referer': f'https://training.gov.au/Organisation/Details/{organisationId}?tabIndex=4',
         'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36',
         'x-requested-with': 'XMLHttpRequest'
        }
        data = {'page': '1', 'size': '100', 'orderBy': 'Code-asc', 'groupBy': '', 'filter': ''}
        r = requests.post(url, json=data, headers=headers)
        response = json.loads(re.sub(r'new Date\((\d+),(\d+),(\d+),0,0,0\)', r'"\1-\2-\2"', r.text))
        return response
response = get_AjaxScopeQualification(organisationId)
dfn = pd.json_normalize(response, 'data', meta=['total'])
print(dfn.columns)
print(dfn[[ 'Code', 'Title', 'Extent']])

result:

response['data'][0]

{'Id': '5096634d-4210-4fd4-a51d-f548cd39d57b',
 'NrtId': '2feb7e3f-7fc6-4719-ba66-2a066f6635c7',
 'RtoId': '3fbfd9c9-3cce-4d69-973e-4e2674f8c5a9',
 'TrainingComponentType': 2,
 'Code': 'BSB20115',
 'Title': 'Certificate II in Business',
 'IsImplicit': False,
 'ExtentId': '01',
 'Extent': 'Deliver and assess',
 'StartDate': '2015-3-3',
 'EndDate': '2022-3-3',
 'DeliveryNsw': True,
 'DeliveryVic': True,
 'DeliveryQld': True,
 'DeliverySa': True,
 'DeliveryWa': True,
 'DeliveryTas': True,
 'DeliveryNt': True,
 'DeliveryAct': True,
 'ScopeDecisionType': 0,
 'ScopeDecision': 'Deliver and assess',
 'OverseasCodeAlpha': None,
 'OverseasCodeAlhpaList': [],
 'OverseasCodeAlphaOutput': ''}

edited Jan 13, 2021 at 5:40

answered Jan 8, 2021 at 3:58

Ferris

5,6611 gold badge18 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

AYUSH NEPAL Over a year ago

Hi! but this is not printing the table

AYUSH NEPAL Over a year ago

i print df but its printing in different format

AYUSH NEPAL Over a year ago

its printing like this: 5096634d-4210-4fd4-a51d-f548cd39d57b ... 19 . its not printing the table with the course name, extent and code

Ferris Over a year ago

try df.iloc[0] or df.to_excel('file.xlsx'), print(df) is not related to real format.

Ferris Over a year ago

or modify the function get_AjaxScopeQualification replace return dfn with return response

|

Ferris · Accepted Answer · 2021-01-13 06:23:21Z

0

To handle -> https://training.gov.au/Search/SearchOrganisation?Name=&IncludeUnregisteredRtos=false&IncludeNotRtos=false&orgSearchByNameSubmit=Search&AdvancedSearch=&JavaScriptEnabled=true

It's ajax link -> https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true

Use ajax link and post method to get the json data.

change 'size': '200' to modify the response output rows.

url = f'https://training.gov.au/Search/AjaxGetOrganisations?implicitNrtScope=True&includeUnregisteredRtosForScopeSearch=True&includeUnregisteredRtos=False&includeNotRtos=False&orgSearchByNameSubmit=Search&JavaScriptEnabled=true'
headers = {
 'origin': 'https://training.gov.au',
 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36',
 'x-requested-with': 'XMLHttpRequest'
}
data = {'page': '1', 'size': '200', 'orderBy': 'LegalPersonName-asc', 'groupBy': '', 'filter': ''}
r = requests.post(url, json=data, headers=headers)
response = r.json()

result

from the Search result, you can get ea38f597-077e-4c57-b7b6-7ca7dde88399 as the OrganisationId, not need to use 'Codes': '6639' to parse https://training.gov.au/Organisation/Details/6639 to get OrganisationId.

'Codes': '6639',
https://training.gov.au/Organisation/Details/6639
https://training.gov.au/Organisation/AjaxScopeSkillSet/ea38f597-077e-4c57-b7b6-7ca7dde88399?includeImplicit=True&tabIndex=4&_=1610518795452

response['data'][0]

{'OrganisationId': 'ea38f597-077e-4c57-b7b6-7ca7dde88399',
 'IsRto': True,
 'IsTpd': False,
 'Codes': '6639',
 'LegalPersonName': '1 EDUCATION PTY LTD',
 'LegalPersonNameNonCurrent': 'Brad Fenby and Associates Pty Ltd, Franklyn Scholar (Victoria) Pty Ltd',
 'TradingNames': [],
 'WebAddresses': ['http://www.1education.com.au'],
 'GeneralEnquiriesPhone': '0478752453',
 'RegistrationStatus': None,
 'ValidationType': 0,
 'RtoStatus': 0,
 'StatusString': 'Current',
 'RegistrationManagerId': '12',
 'RegistrationStartDate': '/Date(1554037200000+1100)/',
 'RegistrationEndDate': '/Date(1774789200000+1100)/',
 'CreatedDate': '/Date(1307654398430+1000)/',
 'ExternalLinks': {'ExternalLinkType': 2,
  'Description': 'MySkillsRto',
  'Url': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode={0}'},
 'RtoType': '91',
 'ActiveScopeAct': True,
 'ActiveScopeNsw': True,
 'ActiveScopeVic': True,
 'ActiveScopeQld': True,
 'ActiveScopeSA': True,
 'ActiveScopeNT': True,
 'ActiveScopeWA': True,
 'ActiveScopeTas': True,
 'ActiveScopeInt': True,
 'RegistrationManagerShortName': 'ASQA',
 'StatusSortOrder': '4',
 'MySkillsLink': 'http://www.myskills.gov.au/RegisteredTrainers/Details?rtocode=6639'}

edited Jan 13, 2021 at 6:23

answered Jan 13, 2021 at 6:15

Ferris

5,6611 gold badge18 silver badges27 bronze badges

8 Comments

AYUSH NEPAL Over a year ago

i was just referring how to use post method as you said. So i need to extract from this output ? like i wanted code, title and extent

Ferris Over a year ago

You are right, use the json result as it contains more information. and you don't need to parse 2 links per Codes. just use the result's OrganisationId.

AYUSH NEPAL Over a year ago

okay i will try and let you know how it goes, but i am very confused as there are 4k links

Ferris Over a year ago

what I mean to use for loop to handle the 4k links, is like studytonight.com/python/web-scraping/scraping-multiple-urls.

Ferris Over a year ago

In this case, it would be to iterate the OrganisationId list.

|

Collectives™ on Stack Overflow

How to extract table from website using python

2 Answers 2

10 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related