I have to make an assumption that the "key" and "value" consist of only
"word characters" (\w) and there are no spaces in them.
Here is my program. Please also see the comments in-line:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexJson {
public static void main(String[] args) {
/*
* Note that the input string, when expressed in a Java program, need escape
* for backslash (\) and double quote ("). If you read directly
* from a file then these escapes are not needed
*/
String input = "{abc:\\\"def\\\", ghi:\\\"jkl\\\"}";
// regex for one pair of key-value pair. Eg: abc:\"edf\"
String keyValueRegex = "(?<key>\\w+):(?<value>\\\\\\\"\\w+\\\\\\\")";
// regex for a list of key-value pair, separated by a comma (,) and a space ( )
String pairsRegex = "(?<pairs>(,*\\s*"+keyValueRegex+")+)";
// regex include the open and closing braces ({})
String regex = "\\{"+pairsRegex+"\\}";
StringBuilder sb = new StringBuilder();
sb.append("{");
Pattern p1 = Pattern.compile(regex);
Matcher m1 = p1.matcher(input);
while (m1.find()) {
String pairs = m1.group("pairs");
Pattern p2 = Pattern.compile(keyValueRegex);
Matcher m2 = p2.matcher(pairs);
String comma = ""; // first time special
while (m2.find()) {
String key = m2.group("key");
String value = m2.group("value");
sb.append(String.format(comma + "\\\"%s\\\":%s", key, value));
comma = ", "; // second time and onwards
}
}
sb.append("}");
System.out.println("input is: " + input);
System.out.println(sb.toString());
}
}
The print out of this program is:
input is: {abc:\"def\", ghi:\"jkl\"}
{\"abc\":\"def\", \"ghi\":\"jkl\"}
setLenient()method. Then write it back as valid JSON.:, but that could defeat you if there are colons in any of the value strings. Other things that could defeat you are escaped quote marks inside one of the values. It might be possible to come up with a complex regex that handles everything, but in a case like this it's best to write your own lexer to process tokens in the input (like{,:,,, identifiers, string literals) and work from that. Overly complex regexes are unreadable and prone to error anyway.