Data scraping from a webpage with javascript using python

Question

I'm trying to scrape the title off of a webpage. Initially, I tried using BeautifulSoup but found out that the page itself wouldn't load without Javascript. So I'm using some code that I found off Google that use the request-html library:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

soup.find_all('h1')

But there's always an error along the line of:

D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
    resp.html.render()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
    content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
    content = await page.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.

Process finished with exit code 1

Does anyone know what this means? I'm quite new to this, so I apologize if I'm using any terminology improperly.

NBlack · Accepted Answer · 2019-06-24 23:47:57Z

1

As Ivan said, here you have full code: sleep=1, keep_page=True make the trick

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))

Response:

[<title>
    Milled wheat and wheat flour produced</title>]

answered Jun 24, 2019 at 23:47

NBlack

3041 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

facsasd Over a year ago

hmm, i wish this what was i was getting, but i still seem to get the same error

NBlack Over a year ago

Did you tried with my code? I run in my console (Python 3.7) and its working like a charm. Please, paste your code now to fix it :)

facsasd Over a year ago

So... i did try your code... sometimes it works sometimes it doesn't and i honestly don't know why anymore

facsasd Over a year ago

I'll try to replicate it

NBlack Over a year ago

I tried 10 times one behind other and works...try to put sleep=2 (2 seconds) if your internet is slow up to 5 sec. sleep – Integer, if provided, of how many long to sleep after initial render.

Ivan Sveshnikov · Accepted Answer · 2019-06-24 23:39:20Z

0

Seems like a bug in underlying library puppeteer, caused by processing some javascript. Here's one workaround from https://github.com/kennethreitz/requests-html/issues/251, maybe it'll help.

resp.html.render(sleep=1, keep_page=True)

answered Jun 24, 2019 at 23:39

Ivan Sveshnikov

3964 silver badges11 bronze badges

3 Comments

facsasd Over a year ago

I tried it out, i still seem to be getting a similar error

Ivan Sveshnikov Over a year ago

You might try to increase sleep parameter. If your page is heavy and machine is slow, it can help.

Nuclear241 Over a year ago

Note for my future self, or, other people: I try specifying only keep_page=True, and it's enough to do the trick.

Andrés Aviña · Accepted Answer · 2019-06-24 23:45:55Z

0

You need to load the JS because if you don't load it the HTML code wont load. You can use Selenium

answered Jun 24, 2019 at 23:45

Andrés Aviña

112 bronze badges

2 Comments

facsasd Over a year ago

hmm, I'm trying to follow along to this tutorial theautomatic.net/2019/01/19/… not sure how it works there

Andrés Aviña Over a year ago

The problem is specifically with the page you want to scrape, because it has security against scrapers.

lowtex · Accepted Answer · 2019-06-24 23:45:55Z

0

Try Seleneum.

Seleneum is a library that allows programs to interact with web pages by taking control of the browser.

Here is an example in an answer to someone else's question.

answered Jun 24, 2019 at 23:45

lowtex

7375 silver badges24 bronze badges

1 Comment

facsasd Over a year ago

hmm, I'm trying to follow along to this tutorial theautomatic.net/2019/01/19/… not sure how it works there

Collectives™ on Stack Overflow

Data scraping from a webpage with javascript using python

4 Answers 4

5 Comments

3 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related