0

I want to get var declared inside a JS in the htm;. but there are no ids, elements. How can I get this data?

Because there is no address, but only var name, I don't know how to do it

Website HTML:

Website HTML picture

<script type="text/javascript">
var imgInfoData = 'data which i want to crawl'

</script>

My python Code:

#set url
HOMEPAGE = "https://land.naver.com/info/complexGallery.nhn?newComplex=Y&startImage=Y&rletNo=102235"


#open web
driver = webdriver.Firefox()
driver.wait = WebDriverWait(driver, 2)
driver.get(HOMEPAGE)

#try to get text from html
time.sleep(1)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, '//script["var"]'))).text
4

1 Answer 1

3

I check the site you are scraping and it seems the scripts was already included in the html page, so i think you don't need to use webdriver and you can just use requests and beautifulsoup.

get the html data using requests:

res = requests.get(url, headers=headers, params=params)

Then Soup the html text to get the script tags and find which tags has the var imgInfoData:

soup = BeautifulSoup(res.text, "html5lib")
    scripts = soup.findAll('script', attrs={'type':'text/javascript'})
    for script in scripts:
        if "var imgInfoData" in script.text: #script with imgInfoData captured
            return script.text.replace("var imgInfoData =","").strip()[:-1]

just remove the

var imgInfoData =

and

;

of the text to get the string value or you could use regex to get the json string inside a text.

Full Code:

import requests
from bs4 import BeautifulSoup

def getimgInfoData():
    url = "https://land.naver.com/info/complexGallery.nhn"
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    params = {"newComplex":"Y",
              "startImage":"Y",
              "rletNo":"102235"}
    res = requests.get(url, headers=headers, params=params)

    soup = BeautifulSoup(res.text, "html5lib")
    scripts = soup.findAll('script', attrs={'type':'text/javascript'})
    for script in scripts:
        if "var imgInfoData" in script.text: #script with imgInfoData captured
            return script.text.replace("var imgInfoData =","").strip()[:-1]
    return None

print(getimgInfoData())

then just convert the result from getimgInfoData() to json if you want.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.