pySpark Create DataFrame from RDD with Key/Value

Question

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example:

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

And have the dataframe look like:

1,2,18
1,10,18
2,20,18

chrisaycock · Accepted Answer · 2017-03-07 16:37:47Z

11

Yes it's possible (tested with Spark 1.3.1) :

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]

edited Mar 7, 2017 at 16:37

chrisaycock

38.1k15 gold badges94 silver badges128 bronze badges

answered May 2, 2015 at 20:43

Olivier Girardot

4,6886 gold badges30 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Frozen Flame Over a year ago

Is this equivolent to rdd.toDF( ["id", "score"])?

Jack Daniel Over a year ago

'RDD' object has no attribute 'toDF' . Facing this error

Jack Daniel Over a year ago

I am using 1.6 spark and pyspark. Unable to load the sql.SQLContext and create DataFrame out of it.

S.I. · Accepted Answer · 2017-02-10 06:19:47Z

0

rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])

df=rdd.toDF(['id','score'])

df.show()

answer is:

+---+-----+
| id|score|
+---+-----+
|  0|    1|
|  0|    1|
|  0|    2|
|  1|    2|
|  1|   10|
|  1|   20|
|  3|   18|
|  3|   18|
|  3|   18|
+---+-----+

edited Feb 10, 2017 at 6:19

S.I.

3,38112 gold badges53 silver badges87 bronze badges

answered Feb 10, 2017 at 4:39

srinivasu

115 bronze badges

Collectives™ on Stack Overflow

pySpark Create DataFrame from RDD with Key/Value

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related