Extract specific text from web-page using python

Question

The "Words" tab in the following URL shows the words which are available in the Arabic course i am following over at Duolingo:

https://duome.eu/theahmedmustafa/progress

The words that I have already learned are in a bold-blue color and the rest in a normal font.

I want to a method (preferably Python or Java) to extract the words that I have learned already. I tried to use Python Requests to access the source code of the page and work from there but it does not seem to contain any information which could be used to filter the learned words from the rest.

Any help would be appreciated!

Image: Snapshot of the page

srinivas-vaddi · Accepted Answer · 2020-05-09 17:59:58Z

As you have mentioned it rightly this is "Web Scraping" and python has amazing modules for the same. Most obvious one is -> BeautifulSoup

So, to get the info from your webpage,

you would need to first understand the structure of the webpage.
Also, in some cases this might not be fully legal
the bigger challenge is, does the webpage support scraping
- this can be figured out by looking at the source of the webpage.
- if the text/info you want to grab is viewable in the source or in one of the hrefs, then it should be possible to scrape it using Beautifulsoup.

Solution -

Before you arrive at a solution you must understand the HTML structure and the ways in which you can identify any element on a webpage
there are many ways, like
- using the "id" of any element on the webpage
- using the class or tagname directly
- using the xpath of the element
- or also, a combination of any o all of the above
once you reach this point, by now it must be clear for you on the way we are gonna proceed further on

#make a request to the webpage, and grab the html respone
page = requests.get("your url here").content

#pass it on to beautifulsoup 
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

#Depending on how you want to find, you can use  findbyclass, findbytag, and #other methods 
soup.findAll('your tag')

Andrej Kesely · Accepted Answer · 2020-05-09 17:57:08Z

This script should print all bold words from your page:

import re
import requests
from bs4 import BeautifulSoup

cookie_url = 'https://duome.eu/tz.php?time=GMT%202'
vocabulary_url = 'https://duome.eu/vocabulary/en/ar/{user_id}'
url = 'https://duome.eu/theahmedmustafa/progress'

with requests.session() as s:
    s.get(cookie_url).text  # load cookies
    html_data = s.get(url).text
    user_id = re.search(r'/vocabulary/en/ar/(\d+)', html_data).group(1)
    soup = BeautifulSoup(s.get(vocabulary_url.format(user_id=user_id)).text, 'html.parser')
    for a in soup.select('#words li > b > a'):
        print(a.text)

This prints:

أَرْوى
أَلْمانْيا
أَمريكا
أَمريكِيّ
أَمْريكِيّة
أَمْسْتِرْدام
أَنا
أَنْتَ
أَنْتِ
أَهْلاً
أَيْن
أُرْدُنِيّ
أُرْدُنِيّة
أُسْتاذ
أُسْتُرالْيا
إِسْكُتْلَنْدا
إِسْكُتْلَنْدِيّ
إِسْلامِيّة
إِنْجِليزِيّ
إِنْجِلْتِرا
امْرَأة
اِمْرَأة
باب
باريس

... and so on.

Collectives™ on Stack Overflow

Extract specific text from web-page using python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related