3

My issue is when I'm using trying to read data from a sql.Row as a String. I'm using pyspark, but I've heard people have this issue with Scala API too.

The pyspark.sql.Row object is a pretty intransigent creature. The following exception is thrown:

java.lang.ClassCastException: [B cannot be cast to java.lang.String
 at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(Row.scala 183)

So what we have is one of the fields is being represented as a byte array. The following python printing constructs do NOT work

repr(sqlRdd.take(2))

Also

import pprint
pprint.pprint(sqlRdd.take(2))

Both result in the ClassCastException.

So.. how do other folks do this? I started to roll my own (can not copy/paste here unfortunately..) But this is a bit re-inventing the wheel .. or so I suspect.

1 Answer 1

4

Try

sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")

I think since Spark 1.1.0 they broke it - reading binary as strings used to work, then they made it not work, but added this flag, but set it's default to false.

Sign up to request clarification or add additional context in comments.

1 Comment

nice! thx for coming back (well later), getting out the backhoe to dig this up, and giving a good solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.