html from requests not the same as source code

Question

I'm trying to scrape this link: 34th government

(https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp)

which has several tables, but when i perform a request using this code:

import requests
from bs4 import BeautifulSoup

govts_url = r'https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp'
website_url = requests.get(govts_url).text
soup = BeautifulSoup(website_url, 'lxml')
print(f"HTML: \n {soup.prettify()}")

I get the following result:

 <html>
 <head>
  <meta charset="utf-8"/>
  <script>
   window.rbzid="Q5gSRBmIWVopQazRgPTWKOEV0wGh1o+KvPO3KMiDuHxM9vVecPeHn4ult+Ba/KU9zInGRSRXUggEmkFs+D5NKSC/WEkCn+B4PCw9CeWkT+Q=";
        u82222.O=function(x){return x;};u82222.E=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.i=function(x,y){return x+y;};u82222.A=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.Y=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.n=function(x,y){return x+y;};u82222.f=function(x,y){return x+y;};u82222.u=function(){var M=function(K,N){var I=N&0xffff;var r=N-I;return(r*K|0)+(I*K|0)|0;},Y=function(x,d,Z){var n=0xcc9e2d51,b=0x1b873593;var E=Z;var O=d&~0x3;for(var w=0;w<O;w+=4){var e=x.charCodeAt(w)&0xff|(x.charCodeAt(w+1)&0xff)<<8|(x.charCodeAt(w+2)&0xff)<<16|(x.charCodeAt(w+3)&0xff)<<24;e=M(e,n);e=(e&0x1ffff)<<15|e>>>17;e=M(e,b);E^=e;E=(E&0x7ffff)<<13|E>>>19;E=E*5+0xe6546b64|0;}e=0;switch(d%4){case 3:e=(x.charCodeAt(O+2)&0xff)<<16;case 2:e|=(x.charCodeAt(O+1)&0xff)<<8;case 1:e|=x.charCodeAt(O)&0xff;e=M(e,n);e=(e&0x1ffff)<<15|e>>>17;e=M(e,b);E^=e;}E^=d;E^=E>>>16;E=M(E,0x85ebca6b);E^=E>>>13;E=M(E,0xc2b2ae35);E^=E>>>16;return E;};return{u:Y};}();u82222.d=function(x,y){return x+y;};u82222.K=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.N=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.Z=function(x,y){return x+y;};u82222.I=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.e=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.t=function(){return{u:function(K){var A='',I=decodeURI("1?'%1CYH.=uVWU~%254_hW,o,WKM%22(-W%5BW,o,LU%075?'%1CH%5D9.5LU%07#?'%1C@W,o6LU%07%22?'%1Ch%5C$.%25N%07D1?'%1C%5DW,o4LU%07%124=LU%07%3C.%25N%07F=?'%1COW,o7bAH%3E?'%1CL%5B.=uF@F%3E%02%25N%07v%0F13SG%5D?,:AWU~13LU%07%3E?'%1CvY8%205FFD.=uF%5BF%3C-%25N%07J1?'%1C%5CW,o7");for(var Y=0,M=0;Y<I.length;Y++,M++){if(M===K.length){M=0;}A+=String.fromCharCode(I.charCodeAt(Y)^K.charCodeAt(M));}A=A.split('~|.');return function(t){return A[t];};}('PA[2))')};}();u82222.o=function(x,y){return x+y;};u82222.r=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.b=function(x,y){return x+y;};u82222.w=function(x){return x;};u82222.s=function(x,y){return x+y;};u82222.F=function(x,y){return x+y;};u82222.M=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.T=function(x,y){return x>y;};function u82222(){}u82222.x=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};(typeof window==="object"?window:global).u82222=u82222;_=window;if(u82222.w(u82222.O(_[u82222.r(24)+u82222.e(0)+u82222.E(25)+u82222.E(14)+u82222.e(18)])||_[u82222.N(26)]||_[u82222.d(u82222.F(u82222.n(u82222.e(28),u82222.r(30))+u82222.N(20),u82222.N(14)),u82222.r(18))]||_[u82222.x(23)])||_[u82222.b(u82222.x(16),u82222.x(19))+u82222.r(6)+u82222.x(11)]||_[u82222.Z(u82222.E(6)+u82222.x(10)+u82222.e(9),u82222.x(14))]||_[u82222.s(u82222.T(975.11,476.89)?u82222.N(8):(13,105.77),u82222.E(1))+u82222.E(5)+u82222.N(25)]||_[u82222.E(4)]||_[u82222.o(u82222.x(3)+u82222.N(29)+u82222.e(14),u82222.e(15))+u82222.N(10)+u82222.x(7)]||_[u82222.i(u82222.e(2)+u82222.N(18)+u82222.N(12)+u82222.e(13)+u82222.x(22)+u82222.E(15)+u82222.E(25),u82222.e(27))+u82222.E(21)]){}else{location[u82222.f(u82222.r(11)+u82222.x(6)+u82222.e(17)+u82222.N(0),u82222.e(2))]();}
  </script>
 </head>
 <body>
 </body>
</html>

Which is, of course, not the content i desire. I guess i'm missing some kind of "activation" to the site, to see the true content. But how can i see it?

Thx!

Did you check whether the content you're after is dynamically generated? — AMC
– AMC, Commented Feb 18, 2020 at 22:20

Josue Rojas Vega · Accepted Answer · 2020-02-18 21:33:26Z

1

I tried with selenium (download the driver that you would, in my case Chromedriver) and it works, you can get the full html source os the page and from here you can continue with the web scraping. I hope this helps you :)

from bs4 import BeautifulSoup
from selenium import webdriver

govts_url = r'https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp'
exe_path = r'C:\Users\JRV\Desktop\WebCrawling/chromedriver.exe'

browser = webdriver.Chrome(exe_path)
browser.get(govts_url)
page = browser.page_source
browser.close()

soup = BeautifulSoup(page, 'html.parser')
print(f"HTML: \n {soup}")

answered Feb 18, 2020 at 21:33

Josue Rojas Vega

364 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hedgy · Accepted Answer · 2020-02-18 18:59:59Z

0

I believe this could be one of those sites where javascript activates the page, which in that case you would have to use something like Selenium. Check out this post.

answered Feb 18, 2020 at 18:59

Hedgy

3561 gold badge3 silver badges17 bronze badges

Collectives™ on Stack Overflow

html from requests not the same as source code

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related