5

I have a string that can be one of two forms:

name multi word description {...}

or

name multi word description [...]

where {...} and [...] are any valid JSON. I am interested in parsing out just the JSON part of the string, but I'm not sure of the best way to do it (especially since I don't know which of the two forms the string will be). This is my current method:

import json

string = 'bob1: The ceo of the company {"salary": 100000}' 
o_ind = string.find('{')
a_ind = string.find('[')

if o_ind == -1 and a_ind == -1:
    print("Could not find JSON")
    exit(0)

index = min(o_ind, a_ind)
if index == -1:
    index = max(o_ind, a_ind)

json = json.loads(string[index:])
print(json)

It works, but I can't help but feel like it could be done better. I thought maybe regex, but I was having trouble with it matching sub objects and arrays and not the outermost json object or array. Any suggestions?

2
  • 3
    I think it is simple and readable, rather than using a complex RegEx. Commented Jan 23, 2016 at 5:29
  • You are importing Json. Just use .parse() Commented Jan 23, 2016 at 5:31

2 Answers 2

10

You can locate the start of the JSON by checking the presence of { or [ and then save everything to the end of the string into a capturing group:

>>> import re
>>> string1 = 'bob1: The ceo of the company {"salary": 100000}'
>>> string2 = 'bob1: The ceo of the company ["10001", "10002"]'
>>> 
>>> re.search(r"\s([{\[].*?[}\]])$", string1).group(1)
'{"salary": 100000}'
>>> re.search(r"\s([{\[].*?[}\]])$", string2).group(1)
'["10001", "10002"]'

Here the \s([{\[].*?[}\]])$ breaks down to:

  • \s - a single space character
  • parenthesis is a capturing group
  • [{\[] would match a single { or [ (the latter needs to be escaped with a backslash)
  • .*? is a non-greedy match for any characters any number of times
  • [}\]] would match a single } and ] (the latter needs to be escaped with a backslash)
  • $ means the end of the string

Or, you may use re.split() to split the string by a space followed by a { or [ (with a positive look ahead) and get the last item. It works for the sample input you've provided, but not sure if this is reliable in general:

>>> re.split(r"\s(?=[{\[])", string1)[-1]
'{"salary": 100000}'
>>> re.split(r"\s(?=[{\[])", string2)[-1]
'["10001", "10002"]'
Sign up to request clarification or add additional context in comments.

Comments

4

You would use simple | in regex matching both needed substrings:

import re
import json

def json_from_s(s):
    match = re.findall(r"{.+[:,].+}|\[.+[,:].+\]", s)
    return json.loads(match[0]) if match else None

And some tests:

print json_from_s('bob1: The ceo of the company {"salary": 100000}')
print json_from_s('bob1: The ceo of the company ["salary", 100000]')
print json_from_s('bob1')
print json_from_s('{1:}')
print json_from_s('[,1]')

Output:

{u'salary': 100000}
[u'salary', 100000]
None
None
None

8 Comments

Consider this case: 'bob1: The ceo of the company [{"salary": 100000}]'. The regex only matches the inner json object and not the outer json array
I only follow the ops question and explanation
I am the OP, and the explanation I gave is that the string can be of the form name multi word description [...]. The case I gave you above follows that pattern, but the regex fails to capture it.
It doesn't fail to catch [...] as you could see from the tests, the one you provided in the comment above won't be caught by the accepted answer either because you didn't specify in your question that json might be inside the list
If you want just catch any json in the string but not list, remove the or part in the regex
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.