1

I have a json structure like

{
    "a": "1",
    "b": "2",
    "c": {
        "d": "3"
    }
}

What I want is to only keep the 1st level of the json, i.e. remove if 1st level's value is not a string, so I have a program like

import json

s = ''' {
    "a": "1",
    "b": "2",
    "c": {
        "d": "3"
    } } '''

data = json.loads(s) 
ret = {}

for k, v in data.items():
    if (isinstance(v, basestring)):
        ret[k] = v

print json.dumps(ret)

Since I need to process huge amount of json string like that, I am looking for if any fastest way or more elegant way to do the same thing in Python

2
  • be careful when you use json string verbatim inside a Python string literal. Use raw-string literal r'' to avoid interpolating backslashes inside json. Commented May 5, 2014 at 17:26
  • if the question is about performance then you should provide a basic benchmark and determine how fast is fast enough in your case. Commented May 5, 2014 at 17:30

1 Answer 1

4

Use a dict comprehension:

ret = {k: v for k, v in json.loads(s).iteritems() if isinstance(v, basestring)}

The dict.iteritems() call here prevents a full list being built first.

If your JSON input is truly huge, consider switching to an iterative JSON parser like ijson, and parse your JSON with an event-driven interface:

import ijson

ret = {}
key = None

with open(some_large_jsonfile) as json_file:
    for prefix, type, value in ijson.parse(json_file):
        if prefix and not '.' in prefix and type == 'string':
            # only top-level string values
            ret[prefix] = value

but it could be a good idea to process the key-value pairs right there and then rather than build up a full dictionary.

Sign up to request clarification or add additional context in comments.

3 Comments

my json is not huge, but I have many lines of json need to process.
@Ryan: dict comprehension might be slower than an explicit loop (here's an example where a generator expression (related concept) is slower than an explicit for-loop). If an individual json object is small then it is not clear what would be faster a loop that uses .iteritems() or .items() (all items have to be created anyway, same logic as xrange() vs range()). Without a benchmark it is hard to say. unicode could be used instead of basestring. Do you mean that you have many small json objects (one per line) e.g., like tweet stream?
@J.F.Sebastian, my json is around 4K in size (average), anyway, I will do the benchmark first. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.