1

In one of my script I use urllib2and BeautifulSoup to parse a HTML page and read a <script> tag.

This is what I get :

<script>
var x_data = {
    logged: logged,
    lengthcarrousel: 2,
    products : [
        {
            "serial" : "106541823"
            ...
</script>

My goal is to read the JSON in the x_data variable and I do not know how to do it properly. I though of :

  • Convert to string and remove the first chars to the { and same for last }
  • Use Regular Expression with something like '{.*}' and take the first group
  • Something else ?

I don't know if these are efficient and if there is some other ways to do it in a nice way.

Do you think a method is preferable to the other ? any method I may not be aware of ?

Thank you in advance for any advice.

EDIT :

Following advice I get the Regexp solution but I can't search in multiple lines despite using re.MULTILINE :

string1 = '<script>
var x_data = {
    logged: logged,
    lengthcarrousel: 2,
    products : [
        {
                "serial" : "106541823"}
]
}; 
</script>'


p = re.compile(r'\{.*\};',re.MULTILINE);
m = p.search(string1)
if m:
    print m.group(0)
else:
    print "Error !"

I always got an "Error !".

EDIT2 :

Works well with re.DOTALL.

7
  • pypi.org/project/jsonfinder Commented Dec 1, 2016 at 16:31
  • Depends on how the input varies. If it's always going to be var x_data = ..., you can just regex replace out that bit anchored to the beginning of the string. Your solution could lie anywhere between as simple as that to as complicated as embedding a JS parser. Commented Dec 1, 2016 at 16:32
  • Hello it will always be var x_data = .... Thanks I am writting the regexp solution right now. Commented Dec 1, 2016 at 16:34
  • 1
    @JohnMath You're not supposed to use multiline matching. Just do a match anchored to the beginning of the string, you don't need to be searching everywhere inside the string. Commented Dec 1, 2016 at 17:05
  • 1
    @JohnMath gist.github.com/masaeedu/c6259e9d1e5f00be65fdeaf7b0c8bee3 Commented Dec 1, 2016 at 17:19

2 Answers 2

2

I think these methods are essentially the same in terms of elegance and performance (using {.*} may be slightly better because .* is greedy, i.e. there will be almost no backtracking, and because it seems to me more "forgiving" for different JS code formatting nuances). What you may be more interested in is this: https://docs.python.org/3.6/library/json.html.

Sign up to request clarification or add additional context in comments.

Comments

2

If it always looks exactly like this, then you can hack a solution like the one you proposed, based on it looking exactly like this.

Because programmers do everything in code, I suspect in practice it will not alway look exactly this, and then any hacky solution will be fragile and will fail at unexpected (read "impossibly inconvenient") moments. (Regex is known to be hacky when it comes to parsing code).

If you want to do this right, you will need to get a real JavaScript parser, apply it to the code fragment defined by the script tag content, to produce an AST, then search the AST for JavaScript nested structures that happen to look like JSON, and take the content of that tree, prettyprinted.

Even this will be fragile in the face of a programmer who assembles the JSON fragment using JavaScript assignment statements. You can handle this by computing data flow, and discovering sets of code that happen to assemble JSON code. This is rather a lot of work.

So you get to decide what the limits on your solution will be, and then accept the consequences when somebody you don't control does something random.

8 Comments

Wow. A downvote. Downvoter, are you willing to explain your reason for downvoting, and In particular, why this answer isn't correct?
I didn't downvote but assume that if "it will always be var x_data = ...." (stackoverflow.com/questions/40915646/…) then the code structure should be quite fixed.
That's for OP to assert and enforce to the extent he can do so. If he can really enforce this then he can hack away safely. I think it is a very bad bet. My experience with people telling me that all their code is written according to a specific style is that the people telling me aren't the ones who wrote the code, and are just lying to themselves and to me, and will make me suffer when it turns out be false. And I have a lot of experience here.
Surely you have, I remember your name because of some excellent answer here on StackOverflow, but after all one should assume something. Otherwise: what if the x_data content is, for example, to be obtained over the network? I think it is not exactly the case for parsing, despite the question title.
In fact, if you don't assume something, the problem is in general impossible; Turing tells that extracting most facts from code is equivalent to the halting problem. The trick is is assume as little as possible. Even better if you verify that your assumptions are true, or true often enough so the exceptions are manageable. (I remember taking a contract with IBM on a source code that the manager swore by company edict was C code, for which we planned to use our C parser; a million lines showed up and half of it was C++. Programmers armed with GCC don't give a damn about corporate edicts).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.