1

The "Words" tab in the following URL shows the words which are available in the Arabic course i am following over at Duolingo:

https://duome.eu/theahmedmustafa/progress

The words that I have already learned are in a bold-blue color and the rest in a normal font.

I want to a method (preferably Python or Java) to extract the words that I have learned already. I tried to use Python Requests to access the source code of the page and work from there but it does not seem to contain any information which could be used to filter the learned words from the rest.

Any help would be appreciated!

Image: Snapshot of the page

2 Answers 2

1

As you have mentioned it rightly this is "Web Scraping" and python has amazing modules for the same. Most obvious one is -> BeautifulSoup

So, to get the info from your webpage,

  • you would need to first understand the structure of the webpage.
  • Also, in some cases this might not be fully legal
  • the bigger challenge is, does the webpage support scraping
    • this can be figured out by looking at the source of the webpage.
    • if the text/info you want to grab is viewable in the source or in one of the hrefs, then it should be possible to scrape it using Beautifulsoup.

Solution -

  • Before you arrive at a solution you must understand the HTML structure and the ways in which you can identify any element on a webpage
  • there are many ways, like

    • using the "id" of any element on the webpage
    • using the class or tagname directly
    • using the xpath of the element
    • or also, a combination of any o all of the above
  • once you reach this point, by now it must be clear for you on the way we are gonna proceed further on

#make a request to the webpage, and grab the html respone
page = requests.get("your url here").content

#pass it on to beautifulsoup 
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

#Depending on how you want to find, you can use  findbyclass, findbytag, and #other methods 
soup.findAll('your tag')
Sign up to request clarification or add additional context in comments.

Comments

0

This script should print all bold words from your page:

import re
import requests
from bs4 import BeautifulSoup

cookie_url = 'https://duome.eu/tz.php?time=GMT%202'
vocabulary_url = 'https://duome.eu/vocabulary/en/ar/{user_id}'
url = 'https://duome.eu/theahmedmustafa/progress'

with requests.session() as s:
    s.get(cookie_url).text  # load cookies
    html_data = s.get(url).text
    user_id = re.search(r'/vocabulary/en/ar/(\d+)', html_data).group(1)
    soup = BeautifulSoup(s.get(vocabulary_url.format(user_id=user_id)).text, 'html.parser')
    for a in soup.select('#words li > b > a'):
        print(a.text)

This prints:

أَرْوى
أَلْمانْيا
أَمريكا
أَمريكِيّ
أَمْريكِيّة
أَمْسْتِرْدام
أَنا
أَنْتَ
أَنْتِ
أَهْلاً
أَيْن
أُرْدُنِيّ
أُرْدُنِيّة
أُسْتاذ
أُسْتُرالْيا
إِسْكُتْلَنْدا
إِسْكُتْلَنْدِيّ
إِسْلامِيّة
إِنْجِليزِيّ
إِنْجِلْتِرا
امْرَأة
اِمْرَأة
باب
باريس

... and so on.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.