0

I have a code which get nested object and remove all nesting (make the object flat):

def flatten_json(y):
    """
    @param y: Unflated Json
    @return: Flated Json
    """
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            out[name[:-1]] = x
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

def generatejson(response):
    sample_object = pd.DataFrame(response.json())['results'].to_dict()
    flat = {k: flatten_json(v) for k, v in sample_object.items()}
    return json.dumps(flat, sort_keys=True)

respons= requests.get(urlApi, data=data, headers=hed, verify=False)
flat1 = generatejson(respons)

....
storage.Bucket(BUCKET_NAME).item(path).write_to(flat1, 'application/json')

This does the following:

  1. Get call from API
  2. remove nested objects
  3. generate json
  4. upload json to Google Storage.

This works great. The problem is that BigQuery does not support Json so I need to convert it to newline Json standard format before the upload.

Is there a way to change return json.dumps(flat, sort_keys=True) so it will return the new Json format and not regular Json?

Sample of my Json:

{"0": {"code": "en-GB", "id": 77, "languageName": "English", "name": "English"}, 
"1": {"code": "de-DE", "id": 78, "languageName": "Deutsch", "name": "German"}}

Edit:

the expected result is of the new line json is:

{"languageName":"English","code":"en-GB","id":2,"name":"English"}
{"languageName":"Deutsch","code":"de-DE","id":5,"name":"German"}

For example if I take the API call and do:

df['results'].to_json(orient="records",lines=True)

This will give the desired output. but I can't do that with json.dumps(flat, sort_keys=True) there is no use of dataframe there.

5
  • By "newline Json standard format", do you mean jsonlines.org? It's strange that BigQuery is rejecting regular json, because as far as I can tell, regular json is also syntactically correct JSON Lines as long as it's all on one line. Commented Jul 30, 2018 at 13:30
  • @Kevin cloud.google.com/bigquery/docs/loading-data-cloud-storage-json "JSON data must be newline delimited" Commented Jul 30, 2018 at 13:31
  • 1
    Right, and if you only have one element, then it doesn't matter what delimiter you use, because delimiters are only necessary to delimit multiple elements. By analogy, consider that Python lists are delimited by commas, but [1] is still a valid list, despite not containing any commas. Commented Jul 30, 2018 at 13:34
  • So maybe try json.dumps(flat, sort_keys=True).replace('\n', ''). You might need to add back a newline on the end. Commented Jul 30, 2018 at 13:35
  • doesn't work. It expect the data to be: {"languageName":"English","code":"en-GB","id":2,"name":"English"} {"languageName":"Deutsch","code":"de-DE","id":5,"name":"German"} For example if you take the sample of my json from question and you'll do df['results'].to_json(orient="records",lines=True) on it (panda dataframe).. this is the output... Commented Jul 30, 2018 at 13:41

2 Answers 2

1

I think you're looking for something like this?

import json

def create_jsonlines(original):

    if isinstance(original, str):
        original = json.loads(original)

    return '\n'.join([json.dumps(original[outer_key], sort_keys=True) 
                      for outer_key in sorted(original.keys(),
                                              key=lambda x: int(x))])

# Added fake record to prove order is sorted
inp = {
   "3": {"code": "en-FR", "id": 76, "name": "French", "languageName": "French"},
   "0": {"code": "en-GB", "id": 77, "languageName": "English", "name": "English"}, 
   "1": {"code": "de-DE", "id": 78, "languageName": "Deutsch", "name": "German"}
   }
output = create_jsonlines(inp)

print(output)
Sign up to request clarification or add additional context in comments.

4 Comments

I changed to storage.Bucket(BUCKET_NAME).item(path).write_to(create_jsonlines(flat1), 'application/json') It doesnt work. AttributeError: 'str' object has no attribute 'items'
@jack try the updated function. If that works, I can then fix the ordering but it's pointless if it doesn't do what you need.
@jack fixed. It's both sorted on the outer keys, and the strings in the output are sorted by the inner key names.
@jack please use the updated version. I had an issue with the sort because your outer keys are string, so we need to convert them to int() for the purposes of sorting.
0

Take a look at jsonlines on GitHub and install it from PyPi with pip install jsonlines. The documentation is available here:

jsonlines is a Python library to simplify working with jsonlines and ndjson data.

This data format is straight-forward: it is simply one valid JSON value per line, encoded using UTF-8. While code to consume and create such data is not that complex, it quickly becomes non-trivial enough to warrant a dedicated library when adding data validation, error handling, support for both binary and text streams, and so on. This small library implements all that (and more!) so that applications using this format do not have to reinvent the wheel.

4 Comments

This doesn't solve my problem. jsonlines has no option to convert json to json new line. Nor does it can solve my problem with the json.dump() Please notice the sort_keys=True. This must stay.
@jack the sort_keys part is not for an individual line, just that the order of the individual lines must keep that sorted order?
My json is with 900+ attributes I will get lost without order. But at this point I'm willing to do whatever just to make it work. then i will handle the order.
the jsonlines lib can sort keys just fine: jsonlines.readthedocs.io/en/latest/#jsonlines.Writer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.