8

The input string is mix of some text with valid JSON:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<TITLE>Title</TITLE>

<META http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<META HTTP-EQUIV="Content-language" CONTENT="en">
<META HTTP-EQUIV="keywords" CONTENT="search words">
<META HTTP-EQUIV="Expires" CONTENT="0">

<script SRC="include/datepicker.js" LANGUAGE="JavaScript" TYPE="text/javascript"></script>
<script SRC="include/jsfunctions.js" LANGUAGE="JavaScript" TYPE="text/javascript"></script>

<link REL="stylesheet" TYPE="text/css" HREF="css/datepicker.css">

<script language="javascript" type="text/javascript">
function limitText(limitField, limitCount, limitNum) {
    if (limitField.value.length > limitNum) {
        limitField.value = limitField.value.substring(0, limitNum);
    } else {
        limitCount.value = limitNum - limitField.value.length;
    }
}
</script>
{"List":[{"ID":"175114","Number":"28992"]}

The task is to deserialize the JSON part of it into some object. The string can begin with some text, but it surely contains the valid JSON. I've tried to use JSON validation REGEX, but there was a problem parsing such pattern in .NET.
So in the end I'd wanted to get only:

{
    "List": [{
        "ID": "175114",
        "Number": "28992"
    }]
}

Clarification 1:
There is only single JSON object in whole the messy string, but the text can contain {}(its actually HTML and can contain javascripts with <script> function(){..... )

15
  • Well... you can just use Json.NET Commented Nov 1, 2016 at 13:49
  • @AndyKorneyev I've tried to, but it can't deserialize the string properly.. So I must somehow tell the JSON.NET how to parse it. Commented Nov 1, 2016 at 13:50
  • Why does the text before and after the JSON portion exist? Can the text contain { and }? If not then the simple solution would be to find the 1st and last bracers and assume that is the start and end of your JSON. Otherwise I'd say you are screwed since you wont be able to tell where the actual JSON starts. Commented Nov 1, 2016 at 13:59
  • Maybe there are some restrictions on that text? For example, can it contain "{" character? Commented Nov 1, 2016 at 13:59
  • 5
    You are screwed, as even {} is valid json and chances are there that your html-containing text contains things that could be an actual json out of context. However you might try to use a DOM-parser to extract your JSON from the HTML, if you have a clue where it is placed. You might be even more screwed, if your JSON is formatted with HTML :X Commented Nov 1, 2016 at 14:05

2 Answers 2

7

You can use this method

    public object ExtractJsonObject(string mixedString)
    {
        for (var i = mixedString.IndexOf('{'); i > -1; i = mixedString.IndexOf('{', i + 1))
        {
            for (var j = mixedString.LastIndexOf('}'); j > -1; j = mixedString.LastIndexOf("}", j -1))
            {
                var jsonProbe = mixedString.Substring(i, j - i + 1);
                try
                {
                    return JsonConvert.DeserializeObject(jsonProbe);
                }
                catch
                {                        
                }
            }
        }
        return null;
    }

The key idea is to search all { and } pairs and probe them, if they contain valid JSON. The first valid JSON occurrence is converted to an object and returned.

Sign up to request clarification or add additional context in comments.

2 Comments

I was going to do almost such thing by finding the occurrences of '{' by hand and then to match some properties. The accepted answer seems more proper to me. Thank you for your time and help too!
@0x49D1 - You are welcome. Just an additional thought: It might be better to probe from the longest match to the shortest (I am not sure, if there is a defined order in the regex matches). If there is no fixed order, then a subpart of the JSON might be recognized as valid JSON as well.
4

Use regex to find all possible JSON structures:

\{(.|\s)*\}

Regex example

Then iterate all these matches unitil you find a match that will not cause an exception:

JsonConvert.SerializeObject(match);

If you know the format of the JSON structure, use JsonSchema.

4 Comments

Thanks! With few refinement that's better solution than magic with substrings!
This regex may be dangerous because it may create catastrophic backtracking.
Thanks @Chrile It doesnt work for the complex json with the nested structure.
Can someone add the regx that also works for json array (array of json object)?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.