1

I have a big amount of HTML files which I want to process using BeautifulSoup and generate some statistics. Although, I came across the problem that the HTML files contain scripts that may generate more HTML code which is not being processed. Therefore, I need to render all Javascript into static HTML before proceeding.

I have seen some options such as using Selenium, but it doesn't seem to fit since I don't want to launch a browser (it should be done in background).

Can someone please suggest an appropriate approach to this?

Thanks in advance!

1 Answer 1

1

Since you need a Javascript engine, using a headless browser is the way to go. Using Selenium web driver with the PhantomJS headless browser is probably your best option:

driver = webdriver.PhantomJS()
driver.get("...")
bs = BeautifulSoup(driver.page_source)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.