Using pyspark, I am reading multiple files containing one JSON-object each from a folder contentdata2,
df = spark.read\
.option("mode", "DROPMALFORMED")\
.json("./data/contentdata2/")
df.printSchema()
content = df.select('fields').collect()
where df.printSchema() yields
root
|-- fields: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- field: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- id: string (nullable = true)
|-- score: double (nullable = true)
|-- siteId: string (nullable = true)
I wish to access fields.element.field, and store each field which equals body, and the field which equals urlhash (for each JSON object).
The format of content is a Row (fields), containing other Rows, like this:
[Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line of text”]), Row(field='urlhash', type=None, value='0a0b774c21c68325aa02cae517821e78687b2780')]), Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line of text”]), Row(field='urlhash', type=None, value='0a0b774c21c6caca977e7821e78687b2780')]), ...
The reason for the reappearing "[Row(fields=[Row(field=....) is because the JSON objects from the different files are being merged together in one list. There are also a lot of other Row elements as well which I am not interested in, and therefore did not include in the example.
The structure of the JSON objects looks like this:
{
"fields": [
{
"field": "body",
"value": [
"Some text",
"Another line of text",
"Third line of text."
]
},
{
"field": "urlhash",
"value": "0a0a341e189cf2c002cb83b2dc529fbc454f97cc"
}
],
"score": 0.87475455,
"siteId": "9222270286501375973",
"id": "0a0a341e189cf2c002cb83b2dc529fbc454f97cc"
}
I wish to store all words from the body of each url, to later remove stopwords and feed it into a K nearest neighbour algorithm.
How do I approach the problem of storing the words from the body for each url, preferably as a tsv or csv with columns urlhash and words (which is a list of words from body)?