With Python, get JSON variable from Javascript

Question

With Python 3.x, I'm trying to get a list of values that are in what looks to be a JSON variable.

Here's some of the HTML:

<script type="text/javascript">

var BandData = {
    id: 171185318,
    name: "MASS",
    fan_email: null,
    account_id: 365569831,
    has_discounts: null,
    image_id: 39000212
};

var EmbedData = {
    tralbum_param: { name: "album", value: 28473799 }, 
    show_campaign: null, 
    embed_info: {"exclusive_embeddable":null,"public_embeddable":"01 Dec 2011 06:09:19 GMT","no_track_preorder":false,"item_public":true}
};

var FanData = {
    logged_in: false,
    name: null,
    image_id: null,
    ip_country_code: null
};

var TralbumData = {
    current: {"require_email_0":1,"new_date":"18 Jan 2017 22:59:06 GMT"},
    is_preorder: null,
    album_is_preorder: null,
    album_release_date: "01 Dec 2017 00:00:00 GMT",
    preorder_count: null,
    hasAudio: true,
    art_id: 3862222,
    trackinfo: [{"video_featured":null,"has_lyrics":false,"file":{"mp3-128":"https://t4.bcbits.com/stream/064bc3d8bb5/mp3-128/35322674"},"is_capped":null,"sizeof_lyrics":0,"duration":143.244,"encodings_id":830008708},{"video_featured":null,"has_lyrics":false,"license_type":0}],
    playing_from: "album page",
    featured_track_id: 8612194,
};

Specifically, within TralbumData, I'm trying to get the URLs within mp3-128 within trackinfo.

It's tricky for me. It looks like JSON data, but I can't quite get that to work.

So far, I'm able to at least isolate trackinfo in the TralbumData variable, with a really kludgy function, but can't quite get it from there. Here's what I have to try and find trackinfo and then get the URLs within...:

def get_HTML(url):
    response = urllib.request.urlopen(url)
    page_source = response.read()
    site_html = page_source.decode('utf8')
    response.close()

    JSON = re.compile('TralbumData = ({.*?});', re.DOTALL)
    matches = JSON.search(site_html)
    info = matches.group(1)
    # print(info)

    data = info.split("\n")
    return data

def get_trackinfo(data):
    # print(data[11])
    for entry in data:
        tmp = entry.split(":")
        if tmp[0].strip() == "trackinfo":
            for ent in tmp:
                tmp = ent.split("mp3-128")
                print(tmp)

Doesn't work since it's splitting with :, effectively separating the http:// part.

I'd think there's a way (and I've looked around and the answers to similar questions here on SO get close, but not quite there), to do say url = my_html['TralbumData']['trackinfo']['mp3-128'] or something.

How much variation is there in that list? Just decode that line (after trackinfo:) as JSON and extract the thing you want from the Python list? — Martijn Pieters
– Martijn Pieters, Commented Dec 28, 2017 at 18:19
@MartijnPieters - I'm almost positive it's always going to have this layout. And I've been trying to get what you suggest, but am having trouble understanding how to decode the line after trackinfo as JSON. Right now, I'm futzing around with a bunch of string/list manipulation which is getting unruly and is almost certainly not the most Pythonic way. — BruceWayne
– BruceWayne, Commented Dec 28, 2017 at 18:21
Split into lines, pick the one with trackerinfo in it, split on : with str.partition(), strip, decode. — Martijn Pieters
– Martijn Pieters, Commented Dec 28, 2017 at 18:22
@MartijnPieters - If I try and split data in to lines, with data.splitlines(), I can't because the type is incorrect. My data is a list. I've edited my OP to show you how I'm getting the HTML currently (get_HTML). I've also found that in get_trackinfo(data), if I do print(data[11]), I correctly get data starting trackinfo: [{"video_featured":null, ...) but still am struggling with how to parse that result...Thanks for your continued help though — BruceWayne
– BruceWayne, Commented Dec 28, 2017 at 18:30

ekhumoro · Accepted Answer · 2017-12-29 17:45:20Z

1

Here is a relatively straightforward solution using json:

import re, json, pprint, urllib.request

regex_data = re.compile(r"""
    ^\s*var\s+TralbumData\s*=\s*\{(.*?)^\};
    """, re.DOTALL | re.MULTILINE | re.VERBOSE)

regex_item = re.compile(r"""
    ^\s*([\'"]?)([a-z][a-z0-9_]*)\1\s*:\s*(.+?)\s*,?\s*$
    """, re.IGNORECASE | re.VERBOSE)

def scrape(url):
    result = {}
    response = urllib.request.urlopen(url)
    html = response.read().decode('utf8')
    response.close()
    match = regex_data.search(html)
    if match is not None:
        for line in match.group(0).splitlines():
            match = regex_item.match(line)
            if match is None:
                continue
            key, value = match.group(2, 3)
            try:
                result[key] = json.loads(value)
            except json.JSONDecodeError:
                pass
    return result

tralbumdata = scrape('https://studiomdhr.bandcamp.com/releases')

pprint.pprint(tralbumdata)

This assumes that the layout of TralbumData object in the javascript code has each of its top-level key:value items on a separate line. It also assumes that all lower-level javascript objects have string keys, as this is required by the json format. (Note that lines ending in a comment cannot be parsed, because json doesn't support comments at all).

edited Dec 29, 2017 at 17:45

answered Dec 28, 2017 at 20:48

ekhumoro

122k23 gold badges272 silver badges400 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

BruceWayne Over a year ago

Woah holy cow, thanks! This works and I can follow most of it. Does Match() find all instances of an item? I.e. if there were say 5 instances of trackinfo or TralbumData, would match return 4 (I assume it starts with 0) and I could loop through the matches via group()?

BruceWayne Over a year ago

Also when if you have a moment, do you mind breaking down the regex match? I'm trying to make sense of it but am still learning there too.

BruceWayne Over a year ago

Does the match = regex.search(html) for some reason stop after the trackinfo section? I'm also trying just to pull featured_track_id from the JSON but can't quite figure out how. I added a if re.match(r'featured_track_id, line)` type part too, but that doesn't seem to grab it. How would I get a value of one of the "top level" variables?

ekhumoro Over a year ago

@BruceWayne. It makes no sense to search for more one trackinfo or TralbumData, because variables and keys cannot be duplicated. I have now updated the script to parse the entire TralbumData object, which is returned as a python dict. The regexes are fairly simple. They mostly just allow for variable amounts of whitespace. The item regex also allows for optional quotes around the keys (the \1 is a back-reference to whatever is matched by the first group).

ekhumoro Over a year ago

@BruceWayne. info['trackinfo'][0]['file'], etc. It has exactly the same structure as the javascrip object.

|

Axalix · Accepted Answer · 2017-12-28 19:00:12Z

1

Here's my solution: 1. get_var function does initial parsing so then you can try to use JSON functions 2. apply json.loads(var) and get access to the JSON elements

import re
import json

text = """
<script type="text/javascript">

var BandData = {
    id: 171185318,
    name: "MASS",
    fan_email: null,
    account_id: 365569831,
    has_discounts: null,
    image_id: 39000212
};

var EmbedData = {
    tralbum_param: { name: "album", value: 28473799 }, 
    show_campaign: null, 
    embed_info: {"exclusive_embeddable":null,"public_embeddable":"01 Dec 2011 06:09:19 GMT","no_track_preorder":false,"item_public":true}
};

var FanData = {
    logged_in: false,
    name: null,
    image_id: null,
    ip_country_code: null
};

var TralbumData = {
    current: {"require_email_0":1,"new_date":"18 Jan 2017 22:59:06 GMT"},
    is_preorder: null,
    album_is_preorder: null,
    album_release_date: "01 Dec 2017 00:00:00 GMT",
    preorder_count: null,
    hasAudio: true,
    art_id: 3862222,
    trackinfo: [{"video_featured":null,"has_lyrics":false,"file":{"mp3-128":"https://t4.bcbits.com/stream/064bc3d8bb5/mp3-128/35322674"},"is_capped":null,"sizeof_lyrics":0,"duration":143.244,"encodings_id":830008708},{"video_featured":null,"has_lyrics":false,"license_type":0}],
    playing_from: "album page",
    featured_track_id: 8612194,
};
"""

def get_var(text, var):
    """
    :type text: str
    :type var: str
    :rtype: str
    """
    pattern = 'var\s+' + var.rstrip() + '\s+?=\s+?{'
    open_token_found = False
    block = '{'
    for line in text.splitlines():
        line = line.strip()
        if not line:
            continue
        if open_token_found:
            if re.match('};', line):
                block += '}'
                break
            else:
                segments = line.split(':', 1)
                key = segments[0]
                if key[0] != '"':
                    key = '"' + key
                if key[-1] != '"':
                    key = key + '"'
                block += key + ':' + segments[1]
        elif re.match(pattern, line):
                open_token_found = True

    if block[-2] == ',':
        block = block[:-2] + '}'
    return json.loads(block)


var = get_var(text, 'TralbumData')
print(var['trackinfo'][0]['file']['mp3-128'])

Output:

https://t4.bcbits.com/stream/064bc3d8bb5/mp3-128/35322674

edited Dec 28, 2017 at 19:00

answered Dec 28, 2017 at 18:48

Axalix

2,8711 gold badge23 silver badges40 bronze badges

4 Comments

BruceWayne Over a year ago

I get an error, json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 32 (char 31), with the return json.loads(block) line highlighted.

Axalix Over a year ago

@BruceWayne I added extended part from the top where I import libraries.

BruceWayne Over a year ago

That's odd, I already have those two libraries importing at the top. Still getting that error though. Does it matter that I'm importing a pages entire html, not just the snippet I provided? (Trying your way, I tweak get_HTML to end with return site_html.)

BruceWayne Over a year ago

From looking around, it looks like this error is because it's not in the appropriate JSON format when I get to .loads()?

Collectives™ on Stack Overflow

With Python, get JSON variable from Javascript

2 Answers 2

6 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related