2

All - I am using BS4 currently to parse a webpage. BS4 returns this block which is coded in JS as a string and cannot recognize the urls I am trying to extract.

I believe the part I am attempting to extract in BS4 is here:

              var vd1="\x3c\x73\x6f\x75\x72\x63\x65\x20\x73\x72\x63\x3d\x27";
              var vd2="\x27\x20\x74\x79\x70\x65\x3d\x27\x76\x69\x64\x65\x6f\x2f\x6d\x70\x34\x27\x3e";

              var luu=pkl("uggc://navzrurnira.rh/tvs.cuc?vcqrgrpgrq");


                    var soienfu="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x36\x36\x2f\x67\x4f\x65\x46\x4d\x6b\x49\x2f\x67\x6a\x77\x42\x67\x45\x66\x61\x62\x78\x33\x58\x66\x6c\x43\x44\x78\x70\x6c\x2b\x7a\x76\x38\x6b\x6c\x6a\x6b\x53\x41\x63\x4f\x34\x4a\x77\x47\x47\x78\x35\x2f\x7c\x71\x6e\x47\x49\x55\x4d\x6a\x34\x54\x48\x69\x34\x41\x45\x75\x4c\x78\x35\x58\x46\x4a\x37\x62\x65\x42\x52\x61\x37\x36\x53\x34\x6e\x46\x6f\x75\x49\x47\x55\x42\x7a\x57\x4e\x44\x61\x4a\x6a\x45\x55\x59\x56\x47\x54\x30\x3d"; soienfu=soienfu.replace(/\|/g,"1"); soienfu=vkl(soienfu); soienfu=dfgsr(soienfu);


                    var iusfdb="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x35\x36\x2b\x4d\x4f\x64\x46\x6b\x6b\x49\x2f\x67\x6a\x77\x42\x67\x45\x66\x61\x62\x78\x33\x58\x66\x6c\x43\x44\x78\x70\x6c\x2b\x7a\x76\x38\x6b\x6c\x6a\x6b\x53\x41\x63\x4f\x34\x4a\x77\x47\x47\x78\x35\x2f\x7c\x71\x6e\x47\x49\x55\x4d\x6a\x34\x54\x48\x69\x34\x41\x45\x75\x4c\x78\x35\x58\x46\x4a\x37\x62\x65\x42\x52\x61\x37\x36\x53\x34\x6e\x46\x6f\x75\x49\x47\x55\x42\x7a\x57\x4e\x44\x61\x4a\x6a\x45\x55\x59\x56\x47\x54\x30\x3d"; iusfdb=iusfdb.replace(/\|/g,"1"); iusfdb=vkl(iusfdb); iusfdb=dfgsr(iusfdb);


                    var ufbjhse="\x61\x78\x66\x52\x33\x64\x54\x46\x33\x72\x6b\x34\x36\x66\x45\x65\x50\x7c\x6c\x6b\x4b\x2f\x73\x76\x78\x52\x67\x4e\x62\x71\x4c\x70\x6c\x6e\x79\x2b\x51\x6e\x35\x30\x6b\x4c\x57\x7a\x74\x6b\x67\x2f\x6b\x48\x77\x32\x64\x61\x7c\x6a\x43\x45\x46\x55\x77\x57\x65\x71\x58\x4d\x51\x51\x6c\x59\x44\x64\x6b\x35\x77\x64\x76\x62\x46\x38\x56\x46\x70\x37\x59\x4f\x6b\x36\x44\x49\x54\x2f\x34\x58\x4d\x2b\x37\x70\x2b\x66\x57\x57\x7a\x43\x54\x75\x6f\x68\x45\x55\x52\x58"; ufbjhse=ufbjhse.replace(/\|/g,"1"); ufbjhse=vkl(ufbjhse); ufbjhse=dfgsr(ufbjhse);


              document.write("<video "+" class='vid'  id='videodiv' width='100%' autoplay='autoplay' preload='none'>"+ vd1 +soienfu+ vd2 + vd1+iusfdb+ vd2 + vd1+ufbjhse+ vd2 +"Your browser does not support the video tag.</video> ");
              }

However, if I see this in the HTML on the website, I get this:

Your browser does not support the video tag.

Ideally, I'd like to pull the video address out of the html block.

http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w75

The code I'm using to get there looks like this.

import requests,bs4,re,sys,os
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
mainsite="http://animeheaven.eu/"
r2=requests.get(url)
r2.raise_for_status()
soup2=bs4.BeautifulSoup(r2.text,"html.parser")
dlink=soup2.select("script")

Now in theory what I'd want to do here is parse dlink for the url, however the JS seems to be causing issues. I am not very familiar with JS and new to web scraping so this is where I get caught up.

# would extract standard url
mylink=re.compile(r"href='(.*)'")
downlink=mylink.search(str(dlink[3]))[1]

1 Answer 1

1

This webpage is a javascript rendered webpage and the content you have in the script there is called 'minified content' which is unreadable to humans (and beautiful soup).

Selenium is a way of executing the javascript used to render the site and then we can process the content there.

I'll walk through the steps of getting and using selenium:

1.Get selenium with pip install selenium

2.Install a driver ( I used the chrome driver )

3.Take a look at the video element with 'inspect element' (right click and you'll see it) with your browser of choice and look for something to identify the video src with, in this case the video has an id with value videodiv. If we inspect this html it looks like:

<video id="videodiv" width="100%" height="100%" style="display: block; cursor: none;" autoplay="autoplay" preload="none">
  <source src="http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s4tyh.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">
  <source src="http://s3sd.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130" type="video/mp4">Your browser does not support the video tag.</video>

enter image description here

4.Using the id and tag we just found above we can now write some python to retrieve it:

from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:\Users\yourname\Desktop\chromedriver.exe")
url="http://animeheaven.eu/watch.php?a=Fairy%20Tail&e=55"
browser.get(url)
viddiv = browser.find_element_by_id('videodiv')
source = viddiv.find_element_by_tag_name('source')
source.get_attribute('src')

Output:

'http://s5vkxea.animeheaven.eu/720kl/msl/Fairy_Tail--55--1449108237__2b0af6.mp4?ww5w130'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.