Extract content from a continuously updating web page , using Python

Question

I am trying to extract table data from the following page:

http://www.mfinante.gov.ro/patrims.html?adbAdbId=4283

The problem is the page seems to be constantly adding rows, dynamically, and using requests returns only the html without the table. I also tried to use selenium, to wait until the page loads fully (as number of rows is finite), but but selenium waits as page loads until the the browser runs out of memory and crashes (at about 100K rows).

My question is, how do i get the content being send to the page, maybe in chunks, and save it? Is there a way to simulate the call the browser is doing?

Here is what i have managed with selenium, which works for smaller samples (ex: adbAdbId=30):

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

data = ''
delay = 800

options = webdriver.ChromeOptions()
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path="chromedriver.exe")
driver.set_page_load_timeout(1000)
url = 'http://www.mfinante.gov.ro/patrims.html?adbAdbId=30'
driver.get(url)

try:
    myElem = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.ID, 'patrims')))
    print("Page is ready!")

except TimeoutException:
        print("Loading took too much time!")


rows = driver.find_elements_by_xpath("//table[@id='patrims']/tbody/tr")
print(len(rows))

listofdicts = []

def builder(outputlist, inputlist):
    #i =0
    for row in inputlist:
        #i+=1
        #print(i)
        soup = BeautifulSoup(row.get_attribute('innerHTML')  , 'html.parser')
        td= soup.find_all('td')


        d = {   "Legend" : soup.find("legend").get_text().strip(),
                "Localitatea" : td[2].get_text().strip(),
                "Strada" : td[4].get_text().strip(),
                "Descriere Tehnica" : td[6].get_text().strip(),
                "Cod de identificare" : td[-7].get_text().strip(),
                "Anul dobandirii sau darii in folosinta " : td[-6].get_text().strip(),
                "Valoare" : td[-5].get_text().strip(),
                "Situatie juridica" : td[-4].get_text().strip(),
                "Situatie juridica actuala" : td[-3].get_text().strip(),
                "Tip bun" : td[-2].get_text().strip(),
                "Stare bun" : td[-1].get_text().strip(),

            }


        outputlist.append(d)
    print('done!')



builder(listofdicts, rows)

print('writing result')
frame = pd.DataFrame(listofdicts)
frame.to_csv(r'output30.csv')

No, not sure how to do that. I also tried using requests.session, but it just got me the same result. Or I didn't use it right. — Petru Tanas
– Petru Tanas, Commented Dec 6, 2019 at 18:28
this should help you pythonbasics.org/selenium_execute_javascript — Debdut Goswami
– Debdut Goswami, Commented Dec 6, 2019 at 18:35
Maybe you can try beautiful soup instead of selenium. As far as i know, it creates a snapshot of the site unlike selenium: crummy.com/software/BeautifulSoup/bs4/doc — fredo.r
– fredo.r, Commented Dec 6, 2019 at 20:17
The problem is beautiful soup depends on requests to get the html. And requests doesn't get the full html — Petru Tanas
– Petru Tanas, Commented Dec 6, 2019 at 20:28

Hubert · Accepted Answer · 2020-01-02 11:58:23Z

1

The page does not update dynamically, it just takes veeery long to load. With

driver.set_page_load_timeout(3600)

and a lot of patience (more than 30 minutes) it works.

A session with requests works too but the server immediately resets the connection with the default user-agent, so I am not sure if they want to be automatically crawled. Please check the site and be a good neticen!

edited Jan 2, 2020 at 11:58

answered Dec 8, 2019 at 16:28

Hubert

4564 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Petru Tanas Over a year ago

That is actually a useful point. I will contact them. Thank you

Collectives™ on Stack Overflow

Extract content from a continuously updating web page , using Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related