1

What I want to do is to scrape the following site https://wiki.openstreetmap.org/wiki/Key:office and specifically the table containing all the tags so everything contained within:

<table class="wikitable taginfo-taglist">...<\table>

since everything within:

<div class="taglist" ...> ... <\div>

(the parent of the table) is generated by JavaScript I thought this code could work:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
    
options = Options()
options.add_argument("--headless")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
driver = webdriver.Firefox(options=options, capabilities=caps, executable_path='../statics/geckodriver')
    
    
def get_tag_soup(url):
    driver.get(url)
    try:
        table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME , "wikitable taginfo-taglist")))
        soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml') 
    except Exception as e:
        soup = e
    
    return soup 

get_tag_soup('https://wiki.openstreetmap.org/wiki/Key:office')

But when I run this code I just get an selenium.common.exceptions.TimeoutException('', None, None) more frustratingly some times if I WebDriverWait for the parent of "wikitable taginfo-taglist" with EC.presence_of_element_located((By.CLASS_NAME , "taglist")) it works.

3
  • if waiting for the parent works, why not do that, then something like table = the_parent.find_element_by_classname('wikitable taginfo-taglist') Commented Feb 5, 2021 at 10:47
  • or just wait longer...site may be slow? Commented Feb 5, 2021 at 10:47
  • waiting for the parent only works sometimes. Is there a way to wait for the whole site ? Commented Feb 5, 2021 at 10:48

1 Answer 1

1

To extract the table containing all the tags instead of presence_of_element_located() you have to induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get("https://wiki.openstreetmap.org/wiki/Key:office")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.wikitable.taginfo-taglist"))).text)
    
  • Using XPATH:

    driver.get("https://wiki.openstreetmap.org/wiki/Key:office")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='wikitable taginfo-taglist']"))).text)
    
  • Console Output:

    Key Value Element Description Map rendering Image Count
    office accountant An office for an accountant.
    6 895
    1 967
    14
    office advertising_agency A service-based business dedicated to creating, planning, and handling advertising.
    3 916
    580
    3
    office architect An office for an architect or group of architects.
    5 715
    1 239
    12
    office association An office of a non-profit organisation, society, e.g. student, sport, consumer, automobile, bike association, etc.
    13 054
    3 286
    50
    office charity An office of a charitable organization
    696
    384
    7
    office company An office of a private company
    129 801
    36 951
    608
    office consulting An office for a consulting firm, providing expert professional advice to other companies or organisations.
    1 341
    162
    4
    office coworking An office where people can go to work (might require a fee); not limited to a single employer
    1 297
    320
    7
    office diplomatic
    6 634
    4 065
    95
    office educational_institution An office for an educational institution.
    14 172
    8 563
    175
    office employment_agency An office for an employment service.
    7 300
    1 771
    43
    office energy_supplier An office for a energy supplier.
    2 237
    1 112
    19
    office engineer An office for an engineer or group of engineers.
    454
    98
    2
    office estate_agent A place where you can rent or buy a house.
    44 813
    8 042
    39
    office financial An office of a company in the financial sector
    4 891
    1 588
    24
    office forestry A forestry office
    523
    741
    9
    office foundation An office of a foundation
    1 757
    542
    10
    office government An office of a (supra)national, regional or local government agency or department
    98 289
    70 569
    2 300
    office guide An office for tour guides, mountain guides, dive guides, etc.
    587
    168
    1
    office insurance An office at which you can take out insurance policies.
    34 693
    6 475
    91
    office it An office for an IT specialist.
    9 486
    2 039
    51
    office lawyer An office for a lawyer.
    22 881
    4 841
    22
    office logistics An office for a forwarder / hauler.
    2 796
    677
    8
    office moving_company An office which offers a relocation service.
    605
    252
    4
    office newspaper An office of a newspaper
    3 511
    1 450
    27
    office ngo An office for a non-profit, non-governmental organisation (NGO).
    12 693
    3 565
    58
    office notary An office for a notary public (common law)
    3 860
    548
    9
    office political_party An office of a political party
    3 354
    1 017
    8
    office property_management Office of a company, which manages a real estate property.
    796
    162
    2
    office quango An office of a quasi-autonomous non-governmental organisation.
    366
    233
    4
    office religion office of a community of faith
    5 807
    2 172
    43
    office research An office for research and development
    3 667
    4 545
    348
    office surveyor An office of a person doing surveys, this can be risk and damage evaluations of properties and equipment, opinion surveys or statistics.
    451
    109
    1
    office tax_advisor An office for a financial expert specially trained in tax law
    5 053
    823
    4
    office telecommunication An office for a telecommunication company
    16 968
    4 335
    77
    office visa An office of an organisation or business which offers visa assistance
    95
    1
    0
    office water_utility The office for a water utility company or water board.
    743
    908
    20
    office yes Generic tag for unspecified office type.
    27 434
    36 155
    420
    

Note: Do ensure you have maximized the browser Viewport as follows:

options.add_argument("start-maximized")
Sign up to request clarification or add additional context in comments.

6 Comments

Thx for the awnser but both xpath and css selector for me produce the same timeout error. maybe the issues is that the driver isn't rendering the javascript?
@Thagor Checkout the updated answer and let me know the status.
sadly it does not solve the issue I tried waiting for 120 seconds which doesn't help either and it tried setting an window size which does nothing as well.
@Thagor Can you just copy and paste my code and retest please?
okay tried it @DebanjanB and the CSS_selector works! "wikitable taginfo-taglist" wasn't the right selector
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.