1

I'm using the Cloudera's SparkOnHBase module in order to get data from HBase.

I get a RDD in this way:

var getRdd = hbaseContext.hbaseRDD("kbdp:detalle_feedback", scan)

Based on that, what I get is an object of type

RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])]

which corresponds to row key and a list of values. All of them represented by a byte array.

If I save the getRDD to a file, what I see is:

([B@f7e2590,[([B@22d418e2,[B@12adaf4b,[B@48cf6e81), ([B@2a5ffc7f,[B@3ba0b95,[B@2b4e651c), ([B@27d0277a,[B@52cfcf01,[B@491f7520), ([B@3042ad61,[B@6984d407,[B@f7c4db0), ([B@29d065c1,[B@30c87759,[B@39138d14), ([B@32933952,[B@5f98506e,[B@8c896ca), ([B@2923ac47,[B@65037e6a,[B@486094f5), ([B@3cd385f2,[B@62fef210,[B@4fc62b36), ([B@5b3f0f24,[B@8fb3349,[B@23e4023a), ([B@4e4e403e,[B@735bce9b,[B@10595d48), ([B@5afb2a5a,[B@1f99a960,[B@213eedd5), ([B@2a704c00,[B@328da9c4,[B@72849cc9), ([B@60518adb,[B@9736144,[B@75f6bc34)])

for each record (rowKey and the columns)

But what I need is to get the String representation of all and each of the keys and values. Or at least the values. In order to save it to a file and see something like

key1,(value1,value2...)

or something like

key1,value1,value2...

I'm completely new on spark and scala and it's being quite hard to get something.

Could you please help me with that?

0

2 Answers 2

6

First lets create some sample data:

scala> val d = List( ("ab" -> List(("qw", "er", "ty")) ), ("cd" -> List(("ac", "bn", "afad")) ) )
d: List[(String, List[(String, String, String)])] = List((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))

This is how the data is:

scala> d foreach println
(ab,List((qw,er,ty)))
(cd,List((ac,bn,afad)))

Convert it to Array[Byte] format

scala> val arrData = d.map { case (k,v) => k.getBytes() -> v.map { case (a,b,c) => (a.getBytes(), b.getBytes(), c.getBytes()) } }

arrData: List[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = List((Array(97, 98),List((Array(113, 119),Array(101, 114),Array(116, 121)))), (Array(99, 100),List((Array(97, 99),Array(98, 110),Array(97, 102, 97, 100)))))

Create an RDD out of this data

scala> val rdd1 = sc.parallelize(arrData)
rdd1: org.apache.spark.rdd.RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = ParallelCollectionRDD[0] at parallelize at <console>:25

Create a conversion function from Array[Byte] to String:

scala> def b2s(a: Array[Byte]): String = new String(a)
b2s: (a: Array[Byte])String

Perform our final conversion:

scala> val rdd2 = rdd1.map { case (k,v) => b2s(k) -> v.map{ case (a,b,c) => (b2s(a), b2s(b), b2s(c)) } }
rdd2: org.apache.spark.rdd.RDD[(String, List[(String, String, String)])] = MapPartitionsRDD[1] at map at <console>:29

scala> rdd2.collect()
res2: Array[(String, List[(String, String, String)])] = Array((ab,List((qw,er,ty))), (cd,List((ac,bn,afad))))
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you very much tuxdna. This is exactly what I needed. Very good explained with all the steps. It seems to be pretty easy now that you wrote the solution :)
Is there any other way round? Converting RDD to ByteArray
@SRIRAMRAMACHANDRAN By default Spark is using default serializer ( which is Java de/serializer ). You could also use Kryo serializer to avoid this manual transformation - For more details spark.apache.org/docs/latest/tuning.html#data-serialization and stackoverflow.com/questions/37790946/… should help.
can you help solve this problem ? [stackoverflow.com/questions/51089412/…
0

I don't know about HBase but if those Array[Byte]s are Unicode strings, something like this should work:

rdd: RDD[(Array[Byte], List[(Array[Byte], Array[Byte], Array[Byte])])] = *whatever*
rdd.map(k, l => 
  (new String(k),
  l.map(a => 
    a.map(elem =>
      new String(elem)
    )
  ))
)

Sorry for bad styling and whatnot, I am not even sure it will work.

4 Comments

thank you very much mehmetminanc. It's not exactly working in that way but it gave me a good idea to face the problem.
@tuxdna explained very neatly, but I don't get how one works and the other does not. Both seem semantically the same.
more than probably mehmetminanc. It's because of my unexperience that I understood better in the other way.
I was not interested in the best answer and tuxdna has the better answer. I was just remarking that they are the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.