I have a bunch of JSON data that I did mostly by hand. Several thousand lines. I need to refactor it into a totally different format using Python.
An overview of my 'stuff':
Column: The basic 'unit' of my data. Each Column has attributes. Don't worry about the meaning of the attributes, but the attributes need to be retained for each Column if they exist.
Folder: Folders group Columns and other Folders together. The folders currently have no attributes, they (currently) only contain other Folder and Column objects (Object does not necessarily refer to JSON objects here... more of an 'entity')
Universe: Universes group everything into big chunks which, in the larger scope of my project, are unable to interact with each other. That is not important here, but that's what they do.
Some limitations:
- Columns cannot contain other Column objects, Folder objects, or Universe objects.
- Folders cannot contain Universe objects.
- Universes cannot contain other Universe objects.
Currently, I have Columns in this form:
"Column0Name": {
"type": "a type",
"dtype": "data type",
"description": "abcdefg"
}
and I need it to go to:
{
"name": "Column0Name",
"type": "a type",
"dtype": "data type",
"description": "abcdefg"
}
Essentially I need to convert the Column key-value things to an array of things (I am new to JSON, don't know the terminology). I also need each Folder to end up with two new JSON arrays (in addition to the "name": "FolderName" key-value pair). It needs a "folders": [] and "columns": [] to be added. So I have this for folders:
"Folder0Name": {
"Column0Name": {
"type": "a",
"dtype": "b",
"description": "c"
},
"Column1Name": {
"type": "d",
"dtype": "e",
"description": "f"
}
}
and need to go to this:
{
"name": "Folder0Name",
"folders": [],
"columns": [
{"name": "Column0Name", "type": "a", "dtype": "b", "description": "c"},
{"name": "Column1Name", "type": "d", "dtype": "e", "description": "f"}
]
}
The folders will also end up in an array inside its parent Universe. Likewise, each Universe will end up with "name", "folders", and "columns" things. As such:
{
"name": "Universe0",
"folders": [a bunch of folders in a JSON array],
"columns": [occasionally some columns in a JSON array]
}
Bottom line:
- I'm going to guess that I need a recursive function to iterate though all the nested dictionaries after I import the JSON data with the
jsonPython module. - I'm thinking some sort of usage of
yieldmight help but I'm not super familiar yet with it. - Would it be easier to update the
dicts as I go, or destroy each key-value pairs and construct an entirely newdictas I go?
Here is what I have so far. I'm stuck on getting the generator to return actual dictionaries instead of a generator object.
import json
class AllUniverses:
"""Container to hold all the Universes found in the json file"""
def __init__(self, filename):
self._fn = filename
self.data = {}
self.read_data()
def read_data(self):
with open(self._fn, 'r') as fin:
self.data = json.load(fin)
return self
def universe_key(self):
"""Get the next universe key from the dict of all universes
The key will be used as the name for the universe.
"""
yield from self.data
class Universe:
def __init__(self, json_filename):
self._au = AllUniverses(filename=json_filename)
self.uni_key = self._au.universe_key()
self._universe_data = self._au.data.copy()
self._col_attrs = ['type', 'dtype', 'description', 'aggregation']
self._folders_list = []
self._columns_list = []
self._type = "Universe"
self._name = ""
self.uni = dict()
self.is_folder = False
self.is_column = False
def output(self):
# TODO: Pass this to json.dump?
# TODO: Still need to get the actual folder and column dictionaries
# from the generators
out = {
"name": self._name,
"type": "Universe",
"folder": [f.me for f in self._folders_list],
"columns": [c.me for c in self._columns_list]}
return out
def update_universe(self):
"""Get the next universe"""
universe_k = next(self.uni_key)
self._name = str(universe_k)
self.uni = self._universe_data.pop(universe_k)
return self
def parse_nodes(self):
"""Process all child nodes"""
nodes = [_ for _ in self.uni.keys()]
for k in nodes:
v = self.uni.pop(k)
self._is_column(val=v)
if self.is_column:
fc = Column(data=v, key_name=k)
self._columns_list.append(fc)
else:
fc = Folder(data=v, key_name=k)
self._folders_list.append(fc)
return self
def _is_column(self, val):
"""Determine if val is a Column or Folder object"""
self.is_folder = False
self._column = False
if isinstance(val, dict) and not val:
self.is_folder = True
elif not isinstance(val, dict):
raise TypeError('Cannot handle inputs not of type dict')
elif any([i in val.keys() for i in self._col_attrs]):
self._column = True
else:
self.is_folder = True
return self
def parse_children(self):
for folder in self._folders_list:
assert(isinstance(folder, Folder)), f'bletch idk what happened'
folder.parse_nodes()
class Folder:
def __init__(self, data, key_name):
self._data = data.copy()
self._name = str(key_name)
self._node_keys = [_ for _ in self._data.keys()]
self._folders = []
self._columns = []
self._col_attrs = ['type', 'dtype', 'description', 'aggregation']
@property
def me(self):
# maybe this should force the code to parse all children of this
# Folder? Need to convert the generator into actual dictionaries
return {"name": self._name, "type": "Folder",
"columns": [(c.me for c in self._columns)],
"folders": [(f.me for f in self._folders)]}
def parse_nodes(self):
"""Parse all the children of this Folder
Parse through all the node names. If it is detected to be a Folder
then create a Folder obj. from it and add to the list of Folder
objects. Else create a Column obj. from it and append to the list
of Column obj.
This should be appending dictionaries
"""
for key in self._node_keys:
_folder = False
_column = False
values = self._data.copy()[key]
if isinstance(values, dict) and not values:
_folder = True
elif not isinstance(values, dict):
raise TypeError('Cannot handle inputs not of type dict')
elif any([i in values.keys() for i in self._col_attrs]):
_column = True
else:
_folder = True
if _folder:
f = Folder(data=values, key_name=key)
self._folders.append(f.me)
else:
c = Column(data=values, key_name=key)
self._columns.append(c.me)
return self
class Column:
def __init__(self, data, key_name):
self._data = data.copy()
self._stupid_check()
self._me = {
'name': str(key_name),
'type': 'Column',
'ctype': self._data.pop('type'),
'dtype': self._data.pop('dtype'),
'description': self._data.pop('description'),
'aggregation': self._data.pop('aggregation')}
def __str__(self):
# TODO: pretty sure this isn't correct
return str(self.me)
@property
def me(self):
return self._me
def to_json(self):
# This seems to be working? I think?
return json.dumps(self, default=lambda o: str(self.me)) # o.__dict__)
def _stupid_check(self):
"""If the key isn't in the dictionary, add it"""
keys = [_ for _ in self._data.keys()]
keys_defining_a_column = ['type', 'dtype', 'description', 'aggregation']
for json_key in keys_defining_a_column:
if json_key not in keys:
self._data[json_key] = ""
return self
if __name__ == "__main__":
file = r"dummy_json_data.json"
u = Universe(json_filename=file)
u.update_universe()
u.parse_nodes()
u.parse_children()
print('check me')
And it gives me this:
{
"name":"UniverseName",
"type":"Universe",
"folder":[
{"name":"Folder0Name",
"type":"Folder",
"columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB0B0>],
"folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB190>]
},
{"name":"Folder2Name",
"type":"Folder",
"columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB040>],
"folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB120>]
},
{"name":"Folder4Name",
"type":"Folder",
"columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB270>],
"folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB200>]
},
{"name":"Folder6Name",
"type":"Folder",
"columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB2E0>],
"folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB350>]
},
{"name":"Folder8Name",
"type":"Folder",
"columns":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB3C0>],
"folders":[<generator object Folder.me.<locals>.<genexpr> at 0x000001ACFBEDB430>]
}
],
"columns":[]
}
If there is an existing tool for this kind of transformation so that I don't have to write Python code, that would be an attractive alternative, too.
type,dtypeanddescriptionare always present in columns? Are those the only attributes for columns?Column,FolderandUniverseand then create a custom encoder and decoder inheriting fromJSONDecoderandJSONEncoder.