0

A site of mine went offline a while ago and I need to recover the images. I've managed to write some python that extracts the code from a script tag with Beautiful Soup. I now need to parse some urls from the extracted text. The urls needed relates to the "large" image. I'm unsure how to incorporate the loop for all images and not just the first and remove the speech marks. Any help would be greatly appreciated

Extracted Text:

var gallery_items = [{
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
    "caption": ""
}, {
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
    "caption": ""
}];

Python Script

from bs4 import BeautifulSoup
import urllib.request as request
import re

folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')

scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text

try:
    found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
    found = 'None Found!'


print(found)

Output

"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg
2
  • I think using xpath can help you a bit more: stackoverflow.com/a/29890627/438627 Commented Oct 18, 2018 at 23:39
  • 1
    are you want this found.replace("\\","") ? Commented Oct 19, 2018 at 7:21

1 Answer 1

1

The given data is in JSON format which will be easy to parse with Python's JSON library. All you need to do is to extract the JSON alone carefully and to supply to the JSON parser. The code might look something like,

import json
script_str = '''var gallery_items = [{ "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg", "caption": "" }, { "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg", "caption": "" }];'''
json_str = script_str[str(script_str).find('var gallery_items = '):str(script_str).find(';')].replace('var gallery_items = ', '')
json_str = json.loads(json_str)
for item in json_str:
    print(item['large'])

Hope this helps! Cheers!

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for taking the time to answer, this is exactly what I was trying to achieve.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.