One of BigQuery limitations for loading data from Json is:
JSON data must be newline delimited
I have this code:
def create_jsonlines(self, original):
if isinstance(original, str):
original = json.loads(original)
return '\n'.join([json.dumps(item) for _, item in original.items()])
This writes regular compressed json to Google Storage:
regular = prefix + '/regular.json.gz'
storage.Bucket('bucket').item(regular).write_to(gzip.compress(bytes((data),encoding='utf8')), 'application/json')
This writes regular compressed json to Google Storage:
newline = prefix + '/newline.json.gz'
storage.Bucket('bucket').item(newline).write_to(gzip.compress(bytes((self.create_jsonlines(data)),encoding='utf8')), 'application/json')
The regular json is OK. it contains everything that it should. But I can't really use it because this format is not supported by BigQuery.
The newline json is not OK. Lots of data is missing.. clearly I'm converting it wrong.
data is a dump as follows: data = json.dumps(result, sort_keys=True)
How can I fix the create_jsonlines function?
json.dump(s)takes theindentargument. If set to 0 or negative, it will insert newlines.