4

I'm trying to use a PCRE regular expression to extract some JSON. I'm using a version of MariaDB which does not have JSON functions but does have REGEX functions.

My string is:

{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush"],"carriers":[],"exclude_carriers":[]}

I want to grab the contents of category. I'd like a matching group that contains 2 items, Jebb and Bush (or however many items are in the array).

I've tried this pattern but it only matches the first occurrence: /(?<=category":\[).([^"]*).*?(?=\])/g

4
  • 4
    One wonders why you're pushing JSON to the DB if you need access to some of the underlying contents within the DB itself? Why not push the data you actually need? Commented Mar 30, 2016 at 11:48
  • 3
    Perl, PHP, JS, etc, etc, have routines for parsing JSON. Do it in application code. Commented Mar 30, 2016 at 23:48
  • I'm accepting ClasGs's answer because in MariaDB I need 1 match with capturing groups instead of multiple matches. ClasG: regex101.com/r/jD1rN6/1 Redu: regex101.com/r/rU6nK8/1 Commented Mar 31, 2016 at 14:46
  • Any item in an array can be either a string, bool, null, object or array. What do you mean by (or however many items are in the array) ? Get only string values ? Commented Oct 13 at 20:24

9 Answers 9

4

Does this match your needs? It should match the category array regardless of its size.

"category":(\[.*?\])

regex101 example

Sign up to request clarification or add additional context in comments.

3 Comments

The OP's regex matched the entire array too, but only pulled out the first value. Yours doesn't even pull out the first value.
@JamesThorpe It pulls out the entire array, bracket to bracket.
But it doesn't pull any values out of it, which is what the OP wants.
3

Using a set of non-capturing group you can extract a predefined json array

regex answer: (?:\"category\":)(?:\[)(.*)(?:\"\])

That expression extract "category":["Jebb","Bush"], so access the first group to extract the array, sample java code:

Pattern pattern = Pattern.compile("(?:\"category\":)(?:\\[)(.*)(?:\"\\])");        
String body = "{\"device_types\":[\"smartphone\"],\"isps\":[\"a\",\"B\"],\"network_types\":[],\"countries\":[],\"category\":[\"Jebb\",\"Bush\"],\"carriers\":[],\"exclude_carriers\":[]}";
Matcher matcher = pattern.matcher(body);
assertThat(matcher.find(), is(true));
String[] categories = matcher.group(1).replaceAll("\"","").split(",");

assertThat(categories.length, is(2));
assertThat(categories[0], is("Jebb"));
assertThat(categories[1], is("Bush"));

Comments

2

JSON not a regular language. Since it allows arbitrary embedding of balanced
delimiters, it must be at least context-free.

For example, consider an array of arrays of arrays:

[ [ [ 1, 2], [2, 3] ] , [ [ 3, 4], [ 4, 5] ] ]
Clearly you couldn't parse that with true regular expressions.
See This Topic: Regex for parsing single key: values out of JSON in Javascript Maybe Helpful for you.

Comments

0

If the number of items in the array is limited (and manageable), you could define it with a finite number of optional items. Like this one with a maximum of 5 items:

"category":\["([^"]*)"(?:,"([^"]*)"(?:,"([^"]*)"(?:,"([^"]*)"(?:,"([^"]*)")?)?)?)?

regex101 example here.

Regards.

Comments

0

There are many ways. One sloppy way to do it is /([A-Z])\w+/g

Please try it on your console like

var data = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush"],"carriers":[],"exclude_carriers":[]}',
     res = [];
data.match(/([A-Z])\w+/g); // ["Jebb", "Bush"]

OK the above was pretty sloppy however a solid single regex solution to extract every single element regardless of the number, one by one and to place them in an array (res) is the following...

var rex = /[",]+(\w*)(?=[",\w]*"],"carriers)/g,
    str = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush","Donald","Trump"],"carriers":[],"exclude_carriers":[]}',
    arr = [],
    res = [];
while ((arr = rex.exec(str)) !== null) {
  res.push(arr[1]); // <- ["Jebb", "Bush", "Donald", "Trump"]
}

Check it out @ http://regexr.com/3d4ee

OK lets do it. I have come up with a devilish idea. If JS had look-behinds this could have been done simply by reversing the applied logic in the previous example where i had used a look-forward. Alas, there aren't... So i decided to turn the world the other way around. Check this out.

String.prototype.reverse = function(){
                             return this.split("").reverse().join("");
                           };
var rex = /[",]+(\w*)(?=[",\w]*"\[:"yrogetac)/g,
    str = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush","Donald","Trump"],"carriers":[],"exclude_carriers":[]}',
    rev = str.reverse();
    arr = [],
    res = [];
    while ((arr = rex.exec(rev)) !== null) {
      res.push(arr[1].reverse()); // <- ["Trump", "Donald", "Bush", "Jebb"]
    }
res.reverse(); // <- ["Jebb", "Bush", "Donald", "Trump"]

Just use your console to confirm.

2 Comments

@GGGforce This solution cant possibly work because it is hard coded to the example string. If the JSON string changes even slightly the code will simply not work. If it ever worked at all. I have absolutely NO idea why you are reversing the string and searching for the words and arrays backwards. IMO this answer should be erased. My solution, which has already been downvoted the next day, should work. If it does not please leave a comment with some sample JSON and I can make sure it produces the output you are looking for.
0

In c++ you can do it like this

bool foundmatch = false;
try {
    std::regex re("\"([a-zA-Z]+)\"*.:*.\\[[^\\]\r\n]+\\]");
    foundmatch = std::regex_search(subject, re);
} catch (std::regex_error& e) {
    // Syntax error in the regular expression
}

Comments

0
(?<=category":\[).[^\]]*

1 Comment

If you have an explanation to add, please edit instead of commenting. And try to do better than "above answer works for me."
0

This PCRE JSON parse regex uses the \G anchor to Extract Array Items from a certain Key.
That key is the "category" key which has an array value.

Only the items from this key's array will be matched, and of those only strings, numbers
boolean or null will be matched.

For a more detailed explanation of the recursion functions an other practical examples,
see this : https://stackoverflow.com/a/79785886/15577665

(?:(?=(?&V_Obj)){(?:(?&V_KeyVal)(?&Sep_Obj))*?\s*"category"\s*:\s*\[\s*|(?!^)\G(?&Sep_Ary)\s*)(?:\s*(?&V_Value)(?&Sep_Ary))*?\K(?:(?&Numb)|(?&Str)|true|false|null)(?(DEFINE)(?<Sep_Ary>\s*(?:,(?!\s*[}\]])|(?=\])))(?<Sep_Obj>\s*(?:,(?!\s*[}\]])|(?=})))(?<Str>(?>"[^\\"]*(?:\\[\s\S][^\\"]*)*"))(?<Numb>(?>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?|(?:[eE][+-]?\d+)))(?<V_KeyVal>(?>\s*(?&Str)\s*:\s*(?&V_Value)\s*))(?<V_Value>(?>(?&Numb)|(?>true|false|null)|(?&Str)|(?&V_Obj)|(?&V_Ary)))(?<V_Ary>\[(?>\s*(?&V_Value)(?&Sep_Ary))*\s*\])(?<V_Obj>{(?>(?&V_KeyVal)(?&Sep_Obj))*\s*}))

https://regex101.com/r/MrMPfW/1

Regex Comments

# PCRE JSON regex - practical application :
# Uses \G anchor to Extract Array Items from a certain Key
# JSON recursion functions by @sln

(?:                                   # ----------
   (?= (?&V_Obj) )                       # Assertion :  Must be a Valid JSON Object
   {
   (?: (?&V_KeyVal) (?&Sep_Obj) )*?      # Drill down to the "category" array
   \s* "category" \s* : \s* \[ \s* 
 |                                      # or ,
   (?! ^ )                               # Here we are in a valid JSON Object or Array
   \G                                    # Continuation anchor, where we last left off
   (?&Sep_Ary) \s*                       # Array Separator
)                                     # ----------

(?: \s* (?&V_Value) (?&Sep_Ary) )*?   # Drill down to the next array item
                                      # Can be anything except an object or an array
\K                                    # Stop recording here, match just the next item
 
(?: (?&Numb) | (?&Str) | true | false | null )

# JSON functions - NoErDet
# ---------------------------------------------
(?(DEFINE)(?<Sep_Ary>\s*(?:,(?!\s*[}\]])|(?=\])))(?<Sep_Obj>\s*(?:,(?!\s*[}\]])|(?=})))(?<Str>(?>"[^\\"]*(?:\\[\s\S][^\\"]*)*"))(?<Numb>(?>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?|(?:[eE][+-]?\d+)))(?<V_KeyVal>(?>\s*(?&Str)\s*:\s*(?&V_Value)\s*))(?<V_Value>(?>(?&Numb)|(?>true|false|null)|(?&Str)|(?&V_Obj)|(?&V_Ary)))(?<V_Ary>\[(?>\s*(?&V_Value)(?&Sep_Ary))*\s*\])(?<V_Obj>{(?>(?&V_KeyVal)(?&Sep_Obj))*\s*}))

Comments

-1

When you have a long string of text to parse, it is probably best practice to deconstruct the string piece by piece instead of trying to use one BIG LONG regular expression. I was able to parse the JSON string piece by piece into the hash %keyValueHash via the following steps.

  1. Remove outer curly braces { ... } from entire line
  2. Separate entire line on ], or ]$ into @keyValuePair
  3. Split each value in @keyValuePair on :
  4. Remove outer double quotes " ... " from $key
  5. Remove outer brackets [ ... ] from $value
  6. Remove all double quotes from $value
  7. Split $value on commas , and store in anonymous list, @value will be a pointer to this list
  8. Find maximum lengths of @keys and @values for table formatting

Here is the code, each line is a contained command instead of one long mysterious regular expression.

#!/usr/bin/perl -w

my $s = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush"],"carriers":[],"exclude_carriers":[]}';
my (@keyValuePair,@keys,@values, %keyValueHash);

#for printf table formatting
my ($largestKey, $largestValue) = (-1,-1);

#remove outer curly braces from entire line, original string preserved
my $copy = $s =~ s/^\{([\w\W]*?)}$/$1/r;

#separate entire line on '],' or ']$'
while( $copy =~ /([\w\W]*?)(\])(,|$)/g ){
  push(@keyValuePair, $1.$2);
}

#separate each @keyValuePair on ':'
for(@keyValuePair){
  my ($key, $value) = split(/:/,$_);

  #remove double quotes from $key
  $key =~ s/^"([\w\W]*?)"/$1/;
  push(@keys, $key);

  #remove outer brackets from $value
  $value =~ s/^\[([\w\W]*?)]$/$1/;

  #remove all double quotes from $value
  $value =~ s/"//g;

  #split $value on ',' and store in anonymous list, @values will contain a pointer to this list
  push(@values, [split(/,/,$value)]);

  #find maximum lengths of $keys and $values for printf table formatting
  $largestKey = length($key) if(length($key) > $largestKey);

  for $v ($values[$#values]){ #the +2 is because values in @values will be surrounded in double quotes
     for(@$v){ $largestValue = length($_)+2 if(length($_) > $largestValue);}
  }
}

#populate %keyValueHash with keys and values
@keyValueHash{@keys} = @values;

#print everything in key "category"
$key = "category";
print "Printing key \"$key\":\n";
printf("%-${largestKey}s : ",$key);
for(@{$keyValueHash{$key}}){ #dereference pointer to @values
  printf("%-${largestValue}s","\"$_\"");
}
print "\n\n";

#print entire hash in printf formatted table
print "Print all keys and values:\n";
for $k (sort keys %keyValueHash){
  printf("%-${largestKey}s : ",$k);
  for(@{$keyValueHash{$k}}){ #dereference pointer to @values
    printf("%-${largestValue}s","\"$_\"")
  }
  print "\n";
}

Output looks like this...

$ perl json.string.pl

Printing key "category":
category         : "Jebb"      "Bush"      

Print all keys and values:
carriers         : 
category         : "Jebb"      "Bush"      
countries        : 
device_types     : "smartphone"
exclude_carriers : 
isps             : "a"         "B"         
network_types    : 

If you can't use a proper JSON parser like JSON::XS, this will probably be your best option. This way will be much easier to maintain if you need to add on or change functionality later. It will also be much better than trying to write something using raw SQL.

You can connect to the Maria database using the DBD::MariaDB driver, or the DBD::MySQL driver. I talked about how to do this in detail in this answer...

Need a regex_match checks if word doesn't starts with letters R or W and contains 4 letters and 3 numbers

I personally used the DBD::MySQL driver instead. The drivers are supposed to be interchangeable, but the MySQL driver is the only one that would install properly. You can try the libdbd-mariadb-perl package, but it gave me an error.

I installed the necessary packages on Ubuntu using the following command

sudo apt install libdbi-perl libdbd-mysql-perl

Once that is installed, connect to the database and run the above code on each JSON string you want to decode, and the data you are looking for will be in %keyValueHash. After the JSON is decoded, you can print reports or run additional inserts, updates, or selects against the database using that dataset.

The above code should also work if the JSON strings contain newlines.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.