How can I extract URLs from within Javascript code? - Python

Question

A site of mine went offline a while ago and I need to recover the images. I've managed to write some python that extracts the code from a script tag with Beautiful Soup. I now need to parse some urls from the extracted text. The urls needed relates to the "large" image. I'm unsure how to incorporate the loop for all images and not just the first and remove the speech marks. Any help would be greatly appreciated

Extracted Text:

var gallery_items = [{
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
    "caption": ""
}, {
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
    "caption": ""
}];

Python Script

from bs4 import BeautifulSoup
import urllib.request as request
import re

folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')

scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text

try:
    found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
    found = 'None Found!'


print(found)

Output

"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg

I think using xpath can help you a bit more: stackoverflow.com/a/29890627/438627 — qasimzee
– qasimzee, Commented Oct 18, 2018 at 23:39

SanthoshSolomon · Accepted Answer · 2018-10-19 07:23:53Z

The given data is in JSON format which will be easy to parse with Python's JSON library. All you need to do is to extract the JSON alone carefully and to supply to the JSON parser. The code might look something like,

import json
script_str = '''var gallery_items = [{ "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg", "caption": "" }, { "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg", "caption": "" }];'''
json_str = script_str[str(script_str).find('var gallery_items = '):str(script_str).find(';')].replace('var gallery_items = ', '')
json_str = json.loads(json_str)
for item in json_str:
    print(item['large'])

Hope this helps! Cheers!

Thank you for taking the time to answer, this is exactly what I was trying to achieve.

Collectives™ on Stack Overflow

How can I extract URLs from within Javascript code? - Python

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related