1

Banging my head here..

I am trying to parse the html source for the entire contents of javascript variable 'ListData' with regex which starts with the declaration var Listdata = and ends with };.

I found a solution which is similar:

Fetch data of variables inside script tag in Python or Content added from js

But I am unable to get it to match the entire regex.

Code:

# Need the ListData object
pat = re.compile('var ListData = (.*?);')

string = """QuickLaunchMenu == null) QuickLaunchMenu = $create(UI.AspMenu, 
null, null, null, $get('QuickLaunchMenu')); } ExecuteOrDelayUntilScriptLoaded(QuickLaunchMenu, 'Core.js');
var ListData = { "Row" : 
[{
"ID": "159",
"PermMask": "0x1b03cc312ef",
"FSObjType": "0",
"ContentType": "Item"
};
moretext;
moretext"""

#Returns NoneType instead of match object
print(type(pat.search(string)))

Not sure what is going wrong here. Any help would be appreaciated.

6
  • You should use the right hand delimiter to the pattern, try '(?m)var ListData = (.*?)};$' Commented Nov 16, 2018 at 16:25
  • @WiktorStribiżew This doesn't match either. Commented Nov 16, 2018 at 16:35
  • 1
    Sure, there are multiple lines, use '(?sm)var ListData = (.*?)};$' Commented Nov 16, 2018 at 16:38
  • @WiktorStribiżew Yup, that works. If you want to write up an official answer I'll accept it. I will have to read up on what (?sm) does. Thanks! Commented Nov 16, 2018 at 16:48
  • Note that this pattern only relies on the fact that the trailing }; appears at the end of the line. Is it really like that? The .*? matches any 0+ chars, as few as possible, up to the first }; that is at the end of the line. Please add those details to the question as it is an important restriction. Commented Nov 16, 2018 at 16:50

1 Answer 1

3

In your regex, (.*?); part matches any 0+ chars other than line break chars up to the first ;. If there is no ; on the line, you will have no match.

Basing on the fact your expected match ends with the first }; at the end of a line, you may use

'(?sm)var ListData = (.*?)};$'

Here,

  • (?sm) - enables re.S (it makes . match any char) and re.M (this makes $ match the end of a line, not just the whole string and makes ^ match the start of line positions) modes
  • var ListData =
  • (.*?) - Group 1: any 0+ chars, as few as possible, up to the first...
  • };$ - }; at the end of a line
Sign up to request clarification or add additional context in comments.

10 Comments

Python has a better way to represent multiline matches: pat = re.compile(r"var ListData = (.*?})?", re.MULTILINE). In this case you also need re.DOTALL since the dot can match a newline, so pat = re.compile(r"var ListData = (.*?})?", re.MULTILINE | re.DOTALL)
@AdamSmith Your r"var ListData = (.*?})?" does not require re.M flag since it has neither $ or ^. re.S is equal to re.DOTALL, and I prefer using inline variants, they are shorter and more universal.
They are shorter, and more universal (inasmuch as the syntax I'm proposing couldn't be less universal since it applies only to Python), but I feel like inline flags muddy the waters a bit in a language that already quickly devolves to gobbledygook. YMMV :)
@AdamSmith I also prefer that way in Python because there are often cases like stackoverflow.com/questions/11958728 or stackoverflow.com/questions/42581
Notably: defining it as an explicit keyword argument is unambiguous. re.compile(somepattern, flags=someflags)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.