2

I am scraping different courses from university sites.

The HTML of the portion of the site is:

<div>
<h2>About the programme</h2>
<p>The National&nbsp;Joint&nbsp;PhD Programme in Nautical Operations&nbsp;is organised as a joint degree between the following four national higher education institutions offering professional maritime education:</p>
<ul>
    <li>Universtity of Troms&oslash; - The Arctic University of Norway (UiT)</li>
    <li>University of&nbsp;South-Eastern&nbsp;Norway (USN)</li>
    <li>Western Norway University of Applied Sciences (HVL)</li>
    <li>Norwegian University of Science and Technology (NTNU)</li>
</ul>
<p>
    The National&nbsp;Joint&nbsp;PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational
    maritime focus.&nbsp;
</p>
<p>
    Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical
    operations.&nbsp;
</p>
<p>The programme has the following&nbsp;vision: to create an internationally recognized national PhD degree in nautical operations.</p>
<p>This vision will be achieved through the following overall objectives:</p>
<ol>
    <li>Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.</li>
    <li>The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.</li>
    <li>Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.</li>
    <li>Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.</li>
    <li>The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.</li>
</ol>
<h2>Academic content</h2>
<p>Nautical operations consist of two subject areas:</p>
<ul>
    <li>
        Nautical studies&nbsp;that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities
        undertaken.
    </li>
    <li>
        The operational perspective&nbsp;includes strategic, tactical and operational aspects.&nbsp;Strategic levels include the choice of type and size of a ship fleet.&nbsp;Tactical aspects concern the design of individual ships and
        the selection of equipment and staff.&nbsp;The operational aspects include planning, implementation and evaluation of nautical operations.
    </li>
</ul>
<p>There is a compulsory&nbsp;joint maritime course offered at all the four institutions.</p>

Link to the site: https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/

I am trying to get the text for the course_description / about_the_course and academic_content as in the 'h2' tags above. I am completely clueless, how can I create a generalized code to scrape tag text according to the h2 tags.

Also, I don't think indexing will help as the order of <'p'> and <'li'> tags will vary from course to course.

0

3 Answers 3

2

You can use .get_text() with separator='\n':

import requests
from bs4 import BeautifulSoup


url = 'https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

desc = soup.find('h2', text=lambda t: 'About the programme' in t)
print( desc.parent.get_text(strip=True, separator='\n') )

Prints:

About the programme
The National Joint PhD Programme in Nautical Operations is organised as a joint degree between the following four national higher education institutions offering professional maritime education:
Universtity of Tromsø
- The Arctic University of Norway (UiT)
University of South-Eastern Norway (USN)
Western Norway University of Applied Sciences
(HVL)
Norwegian University of Science and Technology
(NTNU)
The National Joint PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational maritime focus.
Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical operations.
The programme has the following vision: to create an internationally recognized national PhD degree in nautical operations.
This vision will be achieved through the following overall objectives:
Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.
The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.
Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.
Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.
The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.
Academic content
Nautical operations consist of two subject areas:
Nautical studies that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities undertaken.
The operational perspective includes strategic, tactical and operational aspects. Strategic levels include the choice of type and size of a ship fleet. Tactical aspects concern the design of individual ships and the selection of equipment and staff. The operational aspects include planning, implementation and evaluation of nautical operations.
There is a compulsory joint maritime course offered at all the four institutions.
Sign up to request clarification or add additional context in comments.

Comments

2

It is actually very simple. Just identify the div tag and print the text within it. Here is the full code to do it:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/').text

soup = BeautifulSoup(r,'html5lib')

div_tag = soup.find('div',class_ = "articleelement newtext contentAbove")

print(div_tag.text)

Output:

About the programme
The National Joint PhD Programme in Nautical Operations is organised as a joint degree between the following four national higher education institutions offering professional maritime education:
    Universtity of Tromsø - The Arctic University of Norway (UiT)
    University of South-Eastern Norway (USN)
    Western Norway University of Applied Sciences (HVL)
    Norwegian University of Science and Technology (NTNU)
The National Joint PhD Programme in Nautical Operations will educate qualified candidates for research, teaching, dissemination and innovation work, and other activities requiring scientific insight and an operational maritime focus. 
Implementation of complex nautical operations today requires interdisciplinarity and differentiated competence, including research expertise, for the safe and efficient planning, implementation and evaluation of nautical operations. 
The programme has the following vision: to create an internationally recognized national PhD degree in nautical operations.
This vision will be achieved through the following overall objectives:
    Strengthen the multidisciplinary national expertise in nautical operations through collaboration between the four higher education institutions in Norway with professional maritime education.
    The PhD Programme in Nautical Operations is the preferred Programme in the field and attracts good applicants nationally and internationally from major maritime nations.
    Individuals graduating from the Programme are in demand both nationally and internationally because they have a strong and relevant research-based expertise and the ability to innovate and adapt.
    Increase value creation and innovation through close cooperation between academia, maritime industry and public sector.
    The multidisciplinary national competence related to nautical operations constitutes an internationally recognised professional environment that sets the terms for the development of knowledge in the field.
Academic content
Nautical operations consist of two subject areas:
    Nautical studies that include navigation, maneuvering and transport of floating craft, and operations, indicating that the PhD program will focus on applied research to support, improve and develop the activities undertaken.
    The operational perspective includes strategic, tactical and operational aspects. Strategic levels include the choice of type and size of a ship fleet. Tactical aspects concern the design of individual ships and the selection of equipment and staff. The operational aspects include planning, implementation and evaluation of nautical operations.
There is a compulsory joint maritime course offered at all the four institutions.

This is to get the text. If u just wanna get the headings, here is the complete code:

from bs4 import BeautifulSoup
import requests

r = requests.get('https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/').text

soup = BeautifulSoup(r,'html5lib')

div_tag = soup.find('div',class_ = "articleelement newtext contentAbove")

headings = div_tag.find_all('h2')

for heading in headings:
    print(heading.text)

Output:

About the programme
Academic content

Hope that this helps!

3 Comments

Your code would flawlessly if I had to get all the text. But here, I am trying to get "About the programme" and "Academic content" separately.
Are u trying to get the headings?
Check out my latest edit. I have updated ways to get both the text and the headings.
1

You can try this with selenium

PATH = "./chromedriver"

driver = webdriver.Chrome(PATH)
driver.implicitly_wait(5)

url = "https://www.usn.no/english/research/postgraduate-studies-phd/our-phd-programmes/nautical-operations/"
driver.get(url)

path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'About the programme')]/following-sibling::p"
about_the_program = driver.find_element_by_xpath(path)

path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'Academic content')]/following-sibling::p"
academic_content = driver.find_element_by_xpath(path)

Here you are finding the h2 tag with the text About the programme and/or Academic content. Then you are selecting the following sibling to the h2 tag that is a p tag. If you want a sibling that is some other tag you can specify that in the path.

EDIT 1

if you dont know what the tag will be after the h2 tag then you can probably try this

list_of_tags = ['p', 'ul', 'span']

for tag in list_of_tags:
    path = "//div[@class='articleelement newtext contentAbove']//h2[contains(text(), 'About the programme')]/following-sibling::"
    try:
        path = path+tag
        element_required = driver.find_element_by_xpath(path)
    except Exception as e:
        print(e)

this code will update the path variable with each tag in the list. if the tag exists inside the div then the code will extract the tag else the code will print the error.

6 Comments

didn't know about "following-sibling". Thank You so much. But how can I use 'p' or 'ul' / 'li' with following-sibling?
Check the path in the code. there, you have following-sibling::p. Change the p to whatever sibling you want extract data from. Keep in mind that the sibling should be inside the same element as the h2.
Yes, I got that. What I meant was can I use some kind of 'or' with the following-sibling::??
You can probably use try-except.
I am not sure if that will be as straight forward. But I have edited the code with a solution for that issue. Not the best of solutions but should work
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.