how to cast all columns of dataframe to string

Question

I have a mixed type dataframe. I am reading this dataframe from hive table using spark.sql('select a,b,c from table') command.

Some columns are int , bigint , double and others are string. There are 32 columns in total. Is there any way in pyspark to convert all columns in the data frame to string type ?

cngkaygusuz · Accepted Answer · 2017-10-20 19:07:32Z

57

Just:

from pyspark.sql.functions import col

table = spark.sql("table")

table.select([col(c).cast("string") for c in table.columns])

edited Oct 20, 2017 at 19:07

cngkaygusuz

1,5363 gold badges12 silver badges17 bronze badges

answered Feb 7, 2017 at 2:44

user7526416

5865 silver badges2 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Shuai Liu Over a year ago

This method had performance advantage over withCcolumns when dealing with thounsands columns with version 2.1.0

nam Over a year ago

@user7526416 How would you achieve the same if you wanted to do it on a Spark's Dataframe, say, df?

practicalGuy Over a year ago

df = df.select([col(c).cast("string") for c in df.columns])

mahmoud mehdi · Accepted Answer · 2018-07-24 16:50:25Z

21

Here's a one line solution in Scala :

df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

Let's see an example here :

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
   Row(1, "a"),
   Row(5, "z")
)

val schema = StructType(
  List(
    StructField("num", IntegerType, true),
    StructField("letter", StringType, true)
 )
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)

val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)

I hope it helps

edited Jul 24, 2018 at 16:50

answered Jul 24, 2018 at 16:06

mahmoud mehdi

1,5901 gold badge21 silver badges29 bronze badges

Comments

Hoa Le · Accepted Answer · 2021-03-08 15:11:01Z

4

for col in df_data.columns:
     df_data = df_data.withColumn(col, df_data[col].cast(StringType()))

answered Mar 8, 2021 at 15:11

Hoa Le

411 bronze badge

1 Comment

Mark Rotteveel Over a year ago

Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes.

xuanyue · Accepted Answer · 2018-05-04 23:07:09Z

1

For Scala, spark version > 2.0

case class Row(id: Int, value: Double)

import spark.implicits._

import org.apache.spark.sql.functions._

val r1 = Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)).toDF()

r1.show
+---+-----+
| id|value|
+---+-----+
|  1|  1.0|
|  2|  2.0|
|  3|  3.0|
+---+-----+

val castedDF = r1.columns.foldLeft(r1)((current, c) => current.withColumn(c, col(c).cast("String")))

castedDF.printSchema
root
 |-- id: string (nullable = false)
 |-- value: string (nullable = false)

answered May 4, 2018 at 23:07

xuanyue

1,4382 gold badges20 silver badges38 bronze badges

Comments

geosmart · Accepted Answer · 2021-03-04 06:21:54Z

-2

you can cast single column as this

import pyspark.sql.functions as F
import pyspark.sql.types as T
df = df.withColumn("id", F.col("new_id").cast(T.StringType()))

and just for all column to cast

answered Mar 4, 2021 at 6:21

geosmart

6668 silver badges16 bronze badges

Collectives™ on Stack Overflow

how to cast all columns of dataframe to string

5 Answers 5

3 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related