0

I am trying to extract table data from the following page:

http://www.mfinante.gov.ro/patrims.html?adbAdbId=4283

The problem is the page seems to be constantly adding rows, dynamically, and using requests returns only the html without the table. I also tried to use selenium, to wait until the page loads fully (as number of rows is finite), but but selenium waits as page loads until the the browser runs out of memory and crashes (at about 100K rows).

My question is, how do i get the content being send to the page, maybe in chunks, and save it? Is there a way to simulate the call the browser is doing?

Here is what i have managed with selenium, which works for smaller samples (ex: adbAdbId=30):

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

data = ''
delay = 800

options = webdriver.ChromeOptions()
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path="chromedriver.exe")
driver.set_page_load_timeout(1000)
url = 'http://www.mfinante.gov.ro/patrims.html?adbAdbId=30'
driver.get(url)

try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.ID, 'patrims')))
    print("Page is ready!")

except TimeoutException:
        print("Loading took too much time!")


rows = driver.find_elements_by_xpath("//table[@id='patrims']/tbody/tr")
print(len(rows))

listofdicts = []

def builder(outputlist, inputlist):
    #i =0
    for row in inputlist:
        #i+=1
        #print(i)
        soup = BeautifulSoup(row.get_attribute('innerHTML')  , 'html.parser')
        td= soup.find_all('td')


        d = {   "Legend" : soup.find("legend").get_text().strip(),
                "Localitatea" : td[2].get_text().strip(),
                "Strada" : td[4].get_text().strip(),
                "Descriere Tehnica" : td[6].get_text().strip(),
                "Cod de identificare" : td[-7].get_text().strip(),
                "Anul dobandirii sau darii in folosinta " : td[-6].get_text().strip(),
                "Valoare" : td[-5].get_text().strip(),
                "Situatie juridica" : td[-4].get_text().strip(),
                "Situatie juridica actuala" : td[-3].get_text().strip(),
                "Tip bun" : td[-2].get_text().strip(),
                "Stare bun" : td[-1].get_text().strip(),

            }


        outputlist.append(d)
    print('done!')



builder(listofdicts, rows)

print('writing result')
frame = pd.DataFrame(listofdicts)
frame.to_csv(r'output30.csv')

5
  • have you tried executing scripts? Commented Dec 6, 2019 at 18:25
  • No, not sure how to do that. I also tried using requests.session, but it just got me the same result. Or I didn't use it right. Commented Dec 6, 2019 at 18:28
  • this should help you pythonbasics.org/selenium_execute_javascript Commented Dec 6, 2019 at 18:35
  • Maybe you can try beautiful soup instead of selenium. As far as i know, it creates a snapshot of the site unlike selenium: crummy.com/software/BeautifulSoup/bs4/doc Commented Dec 6, 2019 at 20:17
  • The problem is beautiful soup depends on requests to get the html. And requests doesn't get the full html Commented Dec 6, 2019 at 20:28

1 Answer 1

1

The page does not update dynamically, it just takes veeery long to load. With

driver.set_page_load_timeout(3600)

and a lot of patience (more than 30 minutes) it works.

A session with requests works too but the server immediately resets the connection with the default user-agent, so I am not sure if they want to be automatically crawled. Please check the site and be a good neticen!

Sign up to request clarification or add additional context in comments.

1 Comment

That is actually a useful point. I will contact them. Thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.