How to Extract JSON From HTML Source Code Using Regex

Question

Python Script

import requests
import json
from bs4 import BeautifulSoup
import re

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')

# Save source code to file for testing
with open("sourcecode.html", "w", encoding='utf-8') as file:
    file.write(str(soup))

# Regex pattern to capture JSON data within webpage source code
regex_pattern = r"{\"delivery\"*.*false*}}}"

URL: https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125

I'm trying to pull the JSON data embedded within the source code of the URL listed above using Regex.

I have manually pulled the source code from the URL listed and entered into regex101.com using the following regex pattern:

{\"delivery\"*.*false*}}}

The regex pattern appears to capture the desired JSON data needed.

Issue

When I view the contents of the soup variable or saved file it appears to capture the HTML source code.
However, I do not know how to process regex to only capture the JSON data string needed to build my desired Python Dictionary.

Any help would be greatly appreciated.

have you try this: for content in soup.find_all(re.compile("__your_re_patter")): print(content) — gilzero
– gilzero, Commented Aug 30, 2021 at 12:55

Dan Ciborowski - MSFT · Accepted Answer · 2022-05-17 18:26:54Z

1

Maybe something like this can help you:

import re

url = 'https://www.dunelm.com/product/caldonia-check-natural-eyelet-curtains-1000187301?defaultSkuId=30729125'

r = requests.get(url)
source_text = r.text
# Regex for extract info
json = re.findall('put your regex here', source_text)

To convert the returned list to json you can use:

import json
json_format = json.dumps(json)

edited May 17, 2022 at 18:26

Dan Ciborowski - MSFT

7,31710 gold badges61 silver badges90 bronze badges

answered Aug 30, 2021 at 12:47

BlackMath

1,8661 gold badge13 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MarkWP Over a year ago

Excellent ... The above pulls out the JSON portion from the source code as a LIST type. How do I convert the list to a Dictionary. In the past I would use something like data = json.loads(json), but his throws an error.

Collectives™ on Stack Overflow

How to Extract JSON From HTML Source Code Using Regex

Python Script

Issue

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Python Script

Issue

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related