26

I have a mixed type dataframe. I am reading this dataframe from hive table using spark.sql('select a,b,c from table') command.

Some columns are int , bigint , double and others are string. There are 32 columns in total. Is there any way in pyspark to convert all columns in the data frame to string type ?

5 Answers 5

57

Just:

from pyspark.sql.functions import col

table = spark.sql("table")

table.select([col(c).cast("string") for c in table.columns])
Sign up to request clarification or add additional context in comments.

3 Comments

This method had performance advantage over withCcolumns when dealing with thounsands columns with version 2.1.0
@user7526416 How would you achieve the same if you wanted to do it on a Spark's Dataframe, say, df?
df = df.select([col(c).cast("string") for c in df.columns])
21

Here's a one line solution in Scala :

df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

Let's see an example here :

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
   Row(1, "a"),
   Row(5, "z")
)

val schema = StructType(
  List(
    StructField("num", IntegerType, true),
    StructField("letter", StringType, true)
 )
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)

val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)

I hope it helps

Comments

4
for col in df_data.columns:
     df_data = df_data.withColumn(col, df_data[col].cast(StringType()))

1 Comment

Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes.
1

For Scala, spark version > 2.0

case class Row(id: Int, value: Double)

import spark.implicits._

import org.apache.spark.sql.functions._

val r1 = Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)).toDF()

r1.show
+---+-----+
| id|value|
+---+-----+
|  1|  1.0|
|  2|  2.0|
|  3|  3.0|
+---+-----+

val castedDF = r1.columns.foldLeft(r1)((current, c) => current.withColumn(c, col(c).cast("String")))

castedDF.printSchema
root
 |-- id: string (nullable = false)
 |-- value: string (nullable = false)

Comments

-2

you can cast single column as this

import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("id", F.col("new_id").cast(T.StringType()))

and just for all column to cast

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.