Python extract json structure from html page

Question

in python i'm reading an html page content which contains a lot of stuff. To do this i read the webpage as string by this way:

url = 'https://myurl.com/'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')

if i print the reddit_data i can see correctly the whole html contents. Now, inside it there's a structure like json that i would like to read and extract some fields from that.

Below the structure:

"dealDetails" : {
      "f240141a" : {
         "egressUrl" : "https://ccc.com",
         "title" : "ZZZ",
         "type" : "ghi",
      },
      "5f9ab246" : {
         "egressUrl" : "https://www.bbb.com/",
         "title" : "YYY",
         "type" : "def",
      },
      "2bf6723b" : {
         "egressUrl" : "https://www.aaa.com//",
         "title" : "XXX",
         "type" : "abc",
      },
}

What i want to do is: find the dealDetails field and then for each f240141a 5f9ab246 2bf6723b get the egressURL, title and type values.

Thanks

Can you post the full script tag?

Rakesh
– Rakesh

2019-10-15 07:41:01 +00:00
Commented Oct 15, 2019 at 7:41 — Rakesh
– Rakesh, Commented Oct 15, 2019 at 7:41

shaik moeed · Accepted Answer · 2019-10-16 05:10:00Z

3

Try this,

[nested_dict['egressUrl'] for nested_dict in reddit_data['dealDetails'].keys()]

To access the values of JSON, you can consider as dictionary and use the same syntax to access values as well.

Edit-1:

Make sure your type of reddit_data is a dictionary.

if type(reddit_data) is str.

You need to do..

import ast
reddit_data = ast.literal_eval(reddit_data)

OR

import json
reddit_data = json.loads(reddit_data)

edited Oct 16, 2019 at 5:10

answered Oct 15, 2019 at 7:36

shaik moeed

5,5882 gold badges25 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

xXJohnRamboXx Over a year ago

i tried your suggestion but i get this error:

[nested_dict['egressUrl'] for nested_dict in reddit_data['dealDetails'].keys()] TypeError: string indices must be integers

shaik moeed Over a year ago

@xXJohnRamboXx Read your json data using json.loads(your json data) or ast.literal_eval(your json data)

marc_s · Accepted Answer · 2019-10-27 21:42:32Z

If you just wanted to know how to access the egressURL, title and the type. You might just wanna read the answer below! Be careful however, cause the following code won't work unless you converted your HTML file reddit_data in something like a dictionary ( Modified shaik moeed's answer a tiny bit to also return title and type) :

[(i['egressUrl'], i['title'], i['type']) for i in reddit_data['dealDetails'].keys()]

However, If I got it right, the part you're missing is the conversion from HTML to a JSON friendly file. What I personally use, even though it's quite unpopular, is the eval function

dictionary = eval(reddit_data)

This will convert the whole file into a dictionary, I recommend that you only use it on the part of the text that 'looks' like a dictionary! (One of the reason eval is unpopular, is because it won't convert strings like 'true'/'false' to Python's True/False, be careful with that :) )

Hope that helped!

Collectives™ on Stack Overflow

Python extract json structure from html page

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related