0

I have json data that looks like this (1 object per row):

{
  "id": "c428c2e2-c30c-4864-8c12-458ead4b17f5",
  "weight": 73,
  "topics": {
    "type": 1,
    "values": [
      1,
      2,
      3
    ]
  }
}

When I read in the data without a specified schema, Spark infers topics.values to be an ArrayType but I need it to be a VectorUDT for doing ML tasks. So I am trying to read in the data set using a schema as follows:

    schema = StructType([
        StructField("id", StringType()),
        StructField("weight", IntegerType()),
        StructField("topics", StructType([
            StructField("type", IntegerType()),
            StructField("values", VectorUDT())
        ]))
    ])

When I do this I see the type (using dtype) of the data frame as follows:

[('id', 'string'), ('weight', 'int'), ('topics', 'struct<type:int,values:vector>')]

But there seems to be no actual data in the data frame, as show by using first():

Row(id=None, weight=None, topics=None)

And when I write the data frame to disk, I just see empty braces on each line. Seems odd! What am I doing wrong?

4
  • It is not odd. You pass schema which is not applicable for JSON document. Commented Aug 31, 2016 at 17:52
  • @LostInOverflow Can you elaborate? Obviously I am here asking a question because I don't know that. Commented Aug 31, 2016 at 18:00
  • 1
    @LostInOverflow Well, your comment did make me realize how to do this correctly. So thanks for that. Commented Aug 31, 2016 at 18:04
  • Glad it was helpful and sorry I didn't have more definitive suggestion. Commented Aug 31, 2016 at 18:40

1 Answer 1

1

Well, I figured it out:

Just needed to change the schema a bit:

schema = StructType([
                StructField("id", StringType()),
                StructField("weight", DoubleType()),
                StructField("topics", VectorUDT())
            ])

Now it makes sense.

Sign up to request clarification or add additional context in comments.

1 Comment

whY VectorUDT? Can you explain logic behind it? I checked the documentation here: spark.apache.org/docs/1.5.0/api/java/org/apache/spark/mllib/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.