I have to work with a file format where each row is a json object. For example:
{'Attribute 1': 'A', 'Attribute 2': 1.5, 'Attribute 3': ['A','B','C'], 'Attribute 4': {'A': 5}}
{'Attribute 1': 'B', 'Attribute 2': 2.0, 'Attribute 3': ['A'], 'Attribute 4': {'A': 4}}
{'Attribute 1': 'C', 'Attribute 2': 1.7, 'Attribute 3': ['A','C'], 'Attribute 4': {'A': 3}}
Note that this is not a valid json file format as it is not enclosed in an array. Also, the actual structures are far larger and more nested. These files are distributed in s3. I've only used parquet or csv before, so I'm not sure how to read these files.
I'm currently writing a process to join this data with several other tables, and as the data is large and located in s3 I'm using pyspark.sql in an emr cluster to do the operations. I can create a table with a single column containing the objects as strings using:
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType, StructField, StringType
sqlContext = SQLContext(sc)
schema = StructType([
StructField('json_format', StringType())
])
context = sqlContext.read
context = context.schema(schema)
df = context.load(
folder_path,
format='com.databricks.spark.csv',
delimiter=','
)
df.createOrReplaceTempView('my_table')
How can I transform this column into a dictionary where I can access the various attributes? Is there an equivalent of a lambda function?