29

I am looking for a way to find JSON data in a string. Think about it like wordpress shortcodes. I figure the best way to do it would be a regular Expression. I do not want to parse the JSON, just find all occurences.

Is there a way in regex to have matching numbers of parentheses? Currently I run into that problem when having nested objects.

Quick example for demonstration:

This is a funny text about stuff,
look at this product {"action":"product","options":{...}}.
More Text is to come and another JSON string
{"action":"review","options":{...}}

As a result i would like to have the two JSON strings. Thanks!

6
  • See this question Regex to validate JSON. Commented Feb 24, 2014 at 17:30
  • I think the bigger problem here is why do you have JSON strings embedded in a plain text block? I think improving the design may be a better way to go here than trying to build a regex to find JSON substrings in the wild. Commented Feb 24, 2014 at 17:30
  • Is there a way in regex to have matching numbers of parentheses? -> No. Regex is not made for that. Why don't you use json_decode and parse the result array for data you need? Commented Feb 24, 2014 at 17:30
  • You realize that 42 would be valid JSON? As would "Hi There!"? Unless you restrict your json to be an encoded object only, it's pretty much impossible to detect ALL valid json forms. Commented Feb 24, 2014 at 17:32
  • I want to use these JSON objects as shortcodes like in wordpress. The wordpress implementation is messy, at least that is what i think. To give data to the functions I like to run, I figured JSON would be the best way. As a workaround i could do something like that [[{json}]] and just match [[...]]. However I want to make it as simple as possible. Commented Feb 24, 2014 at 17:33

5 Answers 5

80

Extracting the JSON string from given text

Since you're looking for a simplistic solution, you can use the following regular expression that makes use of recursion to solve the problem of matching set of parentheses. It matches everything between { and } recursively.

Although, you should note that this isn't guaranteed to work with all possible cases. It only serves as a quick JSON-string extraction method.

$pattern = '
/
\{              # { character
    (?:         # non-capturing group
        [^{}]   # anything that is not a { or }
        |       # OR
        (?R)    # recurses the entire pattern
    )*          # previous group zero or more times
\}              # } character
/x
';

preg_match_all($pattern, $text, $matches);
print_r($matches[0]);

Output:

Array
(
    [0] => {"action":"product","options":{...}}
    [1] => {"action":"review","options":{...}}
)

Regex101 Demo


Validating the JSON strings

In PHP, the only way to know if a JSON-string is valid is by applying json_decode(). If the parser understands the JSON-string and is according to the defined standards, json_decode() will create an object / array representation of the JSON-string.

If you'd like to filter out those that aren't valid JSON, then you can use array_filter() with a callback function:

function isValidJSON($string) {
    json_decode($string);
    return (json_last_error() == JSON_ERROR_NONE);
}

$valid_jsons_arr = array_filter($matches[0], 'isValidJSON');

Online demo

Sign up to request clarification or add additional context in comments.

3 Comments

since Java had no recursive steps I just used the pattern 3 times: \{(?:[^{}]|(\{(?:[^{}]|(\{[^{}]*\}))*\}))*\} was sufficient for me... but hacky...
One case where this will not work is when you have curly braces inside a string. e.g: {"text":"abc { def"}
it has problem if the string is too long. But, I found out that adding ThreadStackSize on apache httpd.conf will solve the issue. My question is, what is ThreadStackSize for?
7

Javascript folks looking for similar regex. The (?R) which is recursive regex pattern is not supported by javascript, python, and other languages as such.

Note: It's not 1 on 1 replacement.

 \{(?:[^{}]|(?R))*\} # PCRE Supported Regex

Steps:

  1. Copy the whole regex and replace ?R which copied string example
  • level 1 json => \{(?:[^{}]|(?R))*\} => \{(?:[^{}]|())*\}
  • level 2 json => \{(?:[^{}]|(\{(?:[^{}]|(?R))*\}))*\} => \{(?:[^{}]|(\{(?:[^{}]|())*\}))*\}
  • level n json => \{(?:[^{}]|(?<n times>))*\}
  1. when decided to stop at some level replace ?R with blank string.

Done.

Comments

5

I would add a * to include the nested objects:

{(?:[^{}]*|(?R))*}

Check it Demo

Comments

2

Update 10/5/2025 : Final version functions. New Error detection scheme.
Note that in order to make the (*ACCEPT) verb work correctly
for error detection, a central recursion function must be used
(defined) that contains other recursion function defines embedded
within. From this function all (*ACCEPT) calls are made.
Thus the function (?<Er_Obj> contains (?<Er_Ary> .. ) see regex and comments.
Not sure why is has to be this way, or why that calling (*ACCEPT) from stand alone
parallel functions does not guarantee success.
This structure design below works 100% and fully tested.


For additional info on how this regex works see :
https://stackoverflow.com/a/79785886/15577665


These regex functions will validate as well as find errors in JSON strings.
Their granularity is such that any particular item at any particular level
can be found, removed, modified or replaced without any other needed help.

There are 2 main groups of functions: Validation and Error parsing.
The error parsing matches up to and stops at exactly at the place where
the error is.

So if a JSON is not valid, it can be examined using the error functions to identify
what and where it is. This is accomplished with the (*ACCEPT) verb.

These regex functions can be used to write a JSON query app without much effort.

Function category's :

  • Common : (?&Sep_Ary), (?&Sep_Obj), (?&Str), (?&Numb)
  • Validation : (?&V_KeyVal), (?&V_Value), (?&V_Ary), (?&V_Obj)
  • Error : (?&Er_Obj), (?&Er_Ary), (?&Er_Value)

Free form Demo : https://regex101.com/r/wYoW7v/1

(?:(?:(?&V_Obj)|(?&V_Ary))|(?<Invalid>(?&Er_Obj)|(?&Er_Ary)))(?(DEFINE)(?<Sep_Ary>\s*(?:,(?!\s*[}\]])|(?=\])))(?<Sep_Obj>\s*(?:,(?!\s*[}\]])|(?=})))(?<Er_Obj>(?>{(?:\s*(?&Str)(?:\s*:(?:\s*(?:(?&Er_Value)|(?<Er_Ary>\[(?:\s*(?:(?&Er_Value)|(?&Er_Ary)|(?&Er_Obj))(?:(?&Sep_Ary)|(*ACCEPT)))*(?:\s*\]|(*ACCEPT)))|(?&Er_Obj))(?:(?&Sep_Obj)|(*ACCEPT))|(*ACCEPT))|(*ACCEPT)))*(?:\s*}|(*ACCEPT))))(?<Er_Value>(?>(?&Numb)|(?>true|false|null)|(?&Str)))(?<Str>(?>"[^\\"]*(?:\\[\s\S][^\\"]*)*"))(?<Numb>(?>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?|(?:[eE][+-]?\d+)))(?<V_KeyVal>(?>\s*(?&Str)\s*:\s*(?&V_Value)\s*))(?<V_Value>(?>(?&Numb)|(?>true|false|null)|(?&Str)|(?&V_Obj)|(?&V_Ary)))(?<V_Ary>\[(?>\s*(?&V_Value)(?&Sep_Ary))*\s*\])(?<V_Obj>{(?>(?&V_KeyVal)(?&Sep_Obj))*\s*}))

Regex

# ==========================================
# Validation and Error Detection ..
# ----------------------------------

# Find Valid JSON :
# https://regex101.com/r/pxC3Ph/1

# Find Errors of Failed JSON :
# https://regex101.com/r/pE0vPU/1

# (?m)
# ^ 
(?:
   (?:                                                     # Valid JSON
      (?&V_Obj) 
    | (?&V_Ary) 
   )
 |                                                        # or,
   (?<Invalid>                                             # (1), Invalid JSON - Find the error
      (?&Er_Obj) 
    | (?&Er_Ary) 
   )
)

# ------------------------------------
# JSON --  Function Types / Defines
# ------------------------------------
(?(DEFINE)
   
   # ================
   #  Separators
   # ==============
   
   (?<Sep_Ary>                                             # (2), Separator for Array
     \s*
     (?:
         , 
         (?! \s* [}\]] )
       | (?= \] )
      )
   )
   
   (?<Sep_Obj>                                             # (3), Separator for Object
      \s*
      (?:
         , 
         (?! \s* [}\]] )
       | (?= } )
      )
   )
   
   # ========================
   #  ERROR  Detection 
   # ======================
   
   (?<Er_Obj>                                              # (4), Object Error detection
      (?>
         {                                                       # Open  object brace  {
         (?:
            \s* (?&Str)                                             # Key 
            (?:                                                     # ------------------
               \s* :                                                   # :  Colon separator
               (?:                                                     # Value
                  \s* 
                  (?:
                     (?&Er_Value)                                            # Strings, nums, bool, numbers
                   |                                                        # or,
                     (?<Er_Ary>                                              # (5), Array Error detection
                        \[                                                      # Open array bracket [
                        (?:
                           \s* 
                           (?:
                              (?&Er_Value)                                            # Strings, nums, bool, numbers
                            | (?&Er_Ary)                                              # or,  arrays
                            | (?&Er_Obj)                                              # or,  objects
                           )
                           (?:                                                     # Array separator or (*ACCEPT)
                              (?&Sep_Ary) 
                            | (*ACCEPT) 
                           )
                        )*
                        (?: \s* \] | (*ACCEPT) )                                # Close array bracket ] or (*ACCEPT)
                     )
                   |                                                        # or,
                     (?&Er_Obj)                                              # Object          
                  )                                                       # ------
                  
                  (?:                                                     # Object separator or (*ACCEPT)
                     (?&Sep_Obj) 
                   | (*ACCEPT) 
                  )
                  
                | (*ACCEPT)                                               # Value error, just (*ACCEPT)
               )
             | (*ACCEPT)                                               # No Colon separator, just (*ACCEPT)
            )
         )*
         (?: \s* } | (*ACCEPT) )                                 # Close  object brace  } oe (*ACCEPT)
      )
      
   )
   
   (?<Er_Value>                                            # (6), Values Error detection
      (?>
         (?&Numb)                                                # Numbers
       | (?> true | false | null )                               # Boolean and null
       | (?&Str)                                                 # String
      )
   )
   
   # ========================
   #  Strings and Numbers 
   # ======================
   
   (?<Str>                                                 # (7), String
      (?>
         " [^\\"]* 
         (?: \\ [\s\S] [^\\"]* )*
         "
      )
   )
   # if no control codes, use this :
   # " [^\x00-\x1f\\"]* 
   # (?: \\ [^\x00-\x1f] [^\x00-\x1f\\"]* )*
   # "
   
   (?<Numb>                                                # (8), Numbers
      (?>
         [+-]? 
         (?:
            \d+ 
            (?: \. \d* )?
          | \. \d+ 
         )
         (?: [eE] [+-]? \d+ )?
       | (?: [eE] [+-]? \d+ )
      )
   )
   
   # ==========================
   #  Validation Detection
   # =======================
   
   (?<V_KeyVal>                                            # (9), Validated  Key : Value Pair
      (?>
         \s* (?&Str) \s* : \s* (?&V_Value) \s*
      )
   )
   
   (?<V_Value>                                             # (10), Validated Value
      (?>
         (?&Numb)                                                # Numbers
       | (?> true | false | null )                               # Boolean and null
       | (?&Str)                                                 # String
       | (?&V_Obj)                                               # Object
       | (?&V_Ary)                                               # Array
      )
   )
   
   (?<V_Ary>                                               # (11), Validated Array
      \[ 
      (?>
         \s* (?&V_Value) (?&Sep_Ary) 
      )*
      \s* \] 
   )
   
   (?<V_Obj>                                               # (12), Validated Object
      {
      (?>
         (?&V_KeyVal) (?&Sep_Obj) 
      )*
      \s* }
   )
   
)

4 Comments

Can you please create some examples for this function? for a non regex guru it's hard to decipher what are they used for or how.
There are some references to some varied amount of samples here stackoverflow.com/a/79785886/15577665 based on this library. Everything from the (?(DEFINE) .. line down is a constant set of unchanging functions that comprise the JSON core set. Everything above that line is a usage example of how this core set can be used. The regex skill level to use this can be as a beginner. Basically all the JSON questions on SO can be answered with this if the host language supports the Perl/PCRE style engine.
Well, without a proper explanation and examples I don't consider this answer useful. This site is about self sufficient complete answers, not some discussion that references other answers.
The question was "to find JSON substrings in a string". This (?:(?:(?&V_Obj)|(?&V_Ary))|(?<Invalid>(?&Er_Obj)|(?&Er_Ary))) was the answer. But as a self proclaimed non regex guru I guess you missed that. You should have downvoted this answer before you asked for an example, which was right there the whole time. I'm sure that was your intention from the start.
0

Adding to the answers that suggest ?R for recursion: If you want to match other things as well in a regex string, not just the json object, (i.e.: a json object followed by a string, like key: {jsonobject}), then you want to recurse only the json rule:

(?<j>\{(?:[^{}]|(?&j))*\})

I am using named subpatterns in this example. Notice the ?<j> and the (?&j), which define the subpattern, and reference it respectively). With this you can match the following as an example:

  • Only match the json objects that are followed by "ERROR: ":
ERROR: (?<j>\{(?:[^{}]|(?&j))*\})
ERROR: {"some": "info"}     # will match
INFO: {"some": "info"}      # won't match

See the example on regex101

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.