Download entire webpage (html, image, JS) by Selenium Python

Question

I have to download source code of a website like www.humkinar.pk in simple HTML form. Content on site is dynamically generated. I have tried driver.page_source function of selenium but it does not download page completely such as image and javascript files are left. How can I download complete page. Is there any better and easy solution in python available?

did you find a good way @HafizMuhammadShafiq , i am in the same problem now — Sam
– Sam, Commented Mar 16, 2023 at 8:44

xtonousou · Accepted Answer · 2017-08-28 23:02:09Z

Using Selenium

I know your question is about selenium, but from my experience I am telling you that selenium is recommended for testing and NOT for scraping. It is very SLOW. Even with multiple instances of headless browsers (chrome for your situation), the result is delaying too much.

Recommendation

Python 2, 3

This trio will help you a lot and save you a bunch of time.

Do not use the parser of dryscrape, it is very SLOW and buggy. For this situation, one can use BeautifulSoup with the lxml parser. Use dryscrape to scrape Javascript generated content, plain HTML and images.

If you are scraping a lot of links simultaneously, i highly recommend using something like ThreadPoolExecutor

Edit #1

dryscrape + BeautifulSoup usage (Python 3+)

from dryscrape import start_xvfb
from dryscrape.session import Session
from dryscrape.mixins import WaitTimeoutError
from bs4 import BeautifulSoup

def new_session():
    session = Session()
    session.set_attribute('auto_load_images', False)
    session.set_header('User-Agent', 'SomeUserAgent')
    return session


def session_reset(session):
    return session.reset()


def session_visit(session, url, check):
    session.visit(url)
    # ensure that the market table is visible first
    if check:
        try:
            session.wait_for(lambda: session.at_css(
                'SOME#CSS.SELECTOR.HERE'))
        except WaitTimeoutError:
            pass
    body = session.body()
    session_reset(session)
    return body

# start xvfb in case no X is running (server)
start_xvfb()

SESSION = new_session()
URL = 'https://stackoverflow.com/questions/45796411/download-entire-webpage-html-image-js-by-selenium-python/45824047#45824047'
CHECK = False

BODY = session_visit(SESSION, URL, CHECK)
soup = BeautifulSoup(BODY, 'lxml')

RESULT = soup.find('div', {'id': 'answer-45824047'})

print(RESULT)

SeJaPy · Accepted Answer · 2017-08-22 06:54:06Z

0

I Hope below code will work to download the complete content of the page.

driver.get("http://testurl.com")
pageurl=driver.current_url
page = requests.get(pageurl)
pagecontent=page.content

`pagecontent` will contain the complete code content

answered Aug 22, 2017 at 6:54

SeJaPy

2942 gold badges6 silver badges20 bronze badges

Comments

FeuFeuAngel · Accepted Answer · 2017-08-21 11:59:36Z

-3

It's not allowed to download a website without Permission. If you would know that, you would also know there is hidden Code on hosting Server, where you as Visitior has no access to it.

answered Aug 21, 2017 at 11:59

FeuFeuAngel

234 bronze badges

1 Comment

Hafiz Muhammad Shafiq Over a year ago

If I have access to hosting server then is it possible?

Collectives™ on Stack Overflow

Download entire webpage (html, image, JS) by Selenium Python

3 Answers 3

Using Selenium

Recommendation

Edit #1

dryscrape + BeautifulSoup usage (Python 3+)

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Using Selenium

Recommendation

Edit #1

dryscrape + BeautifulSoup usage (Python 3+)

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related