In one of my script I use urllib2and BeautifulSoup to parse a HTML page and read a <script> tag.
This is what I get :
<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"
...
</script>
My goal is to read the JSON in the x_data variable and I do not know how to do it properly.
I though of :
- Convert to string and remove the first chars to the { and same for last }
- Use Regular Expression with something like '{.*}' and take the first group
- Something else ?
I don't know if these are efficient and if there is some other ways to do it in a nice way.
Do you think a method is preferable to the other ? any method I may not be aware of ?
Thank you in advance for any advice.
EDIT :
Following advice I get the Regexp solution but I can't search in multiple lines despite using re.MULTILINE :
string1 = '<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"}
]
};
</script>'
p = re.compile(r'\{.*\};',re.MULTILINE);
m = p.search(string1)
if m:
print m.group(0)
else:
print "Error !"
I always got an "Error !".
EDIT2 :
Works well with re.DOTALL.
var x_data = ..., you can just regex replace out that bit anchored to the beginning of the string. Your solution could lie anywhere between as simple as that to as complicated as embedding a JS parser.var x_data = .... Thanks I am writting the regexp solution right now.