0

I am trying to create testing data frame with one column with Int and one column with String type. With output similar to below. I reckon for Int we could use

data = spark.range(1, 5)
output = dataset.withColumnRenamed('id','myid')

How do we deal with that string column? Many thanks for your help!

Expected output:

      id.     ordernum
       1       0032
       2       0033
       3       0034
       4       0035
       5       0036

2 Answers 2

2

You can create a Spark dataframe from a list of lists. Here is an example:

data = [[i, '%04d' % (i+31)] for i in range(1,6)]
# [[1, '0032'], [2, '0033'], [3, '0034'], [4, '0035'], [5, '0036']]

df = spark.createDataFrame(data, ['id', 'ordernum'])
df.show()
+---+--------+
| id|ordernum|
+---+--------+
|  1|    0032|
|  2|    0033|
|  3|    0034|
|  4|    0035|
|  5|    0036|
+---+--------+

If you prefer Spark range, you can use format_string:

import pyspark.sql.functions as F
df = spark.range(1, 6).withColumn(
    'ordernum',
    F.format_string('%04d', F.col('id') + 31)
)

df.show()
+---+--------+
| id|ordernum|
+---+--------+
|  1|    0032|
|  2|    0033|
|  3|    0034|
|  4|    0035|
|  5|    0036|
+---+--------+
Sign up to request clarification or add additional context in comments.

Comments

2

You can use lpad function to create ordernum column from id + 31 column left padded with 0 to get a string number with 4 digits:

from pyspark.sql import functions as F

output = spark.range(1, 6).withColumn("ordernum", F.lpad(col("id") + 31, 4, '0'))

output.show()
#+---+--------+
#| id|ordernum|
#+---+--------+
#|  1|    0032|
#|  2|    0033|
#|  3|    0034|
#|  4|    0035|
#|  5|    0036|
#+---+--------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.