A site of mine went offline a while ago and I need to recover the images. I've managed to write some python that extracts the code from a script tag with Beautiful Soup. I now need to parse some urls from the extracted text. The urls needed relates to the "large" image. I'm unsure how to incorporate the loop for all images and not just the first and remove the speech marks. Any help would be greatly appreciated
Extracted Text:
var gallery_items = [{
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
"caption": ""
}, {
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
"caption": ""
}];
Python Script
from bs4 import BeautifulSoup
import urllib.request as request
import re
folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')
scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text
try:
found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
found = 'None Found!'
print(found)
Output
"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg
found.replace("\\","")?