1

I've used BeautifulSoup to get the below snippet from an HTML page. I'm having trouble stripping out just the JSON (after FB_DATA). I'm guessing I need to use re.search, but I'm having trouble with the REGEX.

The snippet is:

<script type="text/javascript">
    var FB_DATA = {
        "foo": bar,
        "two": {
          "foo": bar,
        }};
    var FB_PUSH = []; 
    var FB_PULL = []; 
</script>
2
  • What do you have for a regex so far? Commented May 27, 2014 at 18:48
  • Honestly, I don't even know where to start. I hate posting with so little to go on, but i'm just learning and i'm not strong with regex. Commented May 27, 2014 at 18:50

2 Answers 2

6

I'm assuming your main issue is using a .*? when . matches anything but new lines. Using the s dot-matches-newline modifier, you can accomplish this very simply:

(?s)    (?# dot-match-all modifier)
var     (?# match var literally)
\s+     (?# match 1+ whitespace)
FB_DATA (?# match FB_DATA literally)
\s*     (?# match 0+ whitespace)
=       (?# match = literally)
\s*     (?# match 0+ whitespace)
(       (?# start capture group)
 \{     (?# match { literally)
 .*?    (?# lazily match 0+ characters)
 \}     (?# match } literally)
)       (?# end capture group)
;       (?# match ; literally)

Demo


Your JSON string will be in capture group #1.

m = re.search(r"(?s)var\s+FB_DATA\s*=\s*(\{.*?\});", html)
print m.group(1)
Sign up to request clarification or add additional context in comments.

Comments

0

start with

FB_DATA = (\{[^;]*;)

and see in which cases it's not enough.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.