Spark Scala - DataFrames & csv - partial extraction of values

Question

I have a CSV file called sampleOrder.csv which looks like this:

CarrierName,CarrierCustomerNumber,CarrierReference,CustomerReference,TransportDate,postcode,ProductDescription,ServiceDescription
DPD UK,260432,1.5503E+13,JO01974834,1/14/2013,LU7 4QT,PARCEL,NXTDAY
DPD UK,260364,1.55011E+13,C015800315,12/31/2012,BS3  5DH,PARCEL,NXTDAY
DPD UK,260268, 15501675752897R,953902,1/15/2013,CV10 7RL,REVERSE
IT,NXTDAY DPD UK,260162,1.55017E+13,C015889556,1/14/2013,IP13
6ET,PARCEL,  NXTDAY DPD UK,260364,1.55011E+13,C015939958,1/21/2013,SW6
7JY,PARCEL,  NXTDAY DPD
UK,260363,1.55012E+13,C015854701,1/10/2013,RG41 2AN,PARCEL,  NXTDAY
DPD UK,260364,1.55011E+13,C015945032,1/22/2013,RG5  4JB,PARCEL, 
NXTDAY DPD UK,260268,1.55017E+13,967819,1/11/2013, HD1 2QE,PARCEL, 
NXTDAY DPD UK,260364,1.55011E+13,C015966537,1/24/2013,ST1  6SL,HOME
DELIVERY,AFNOON DPD UK,260364,
15500557912288R,C015821652,1/4/2013,CV10 7RL,SWAPIT,NXTDAY

I have created a spark SQL context and I load the csv file into a dataframe like this:

val OrdersRAW = spark.read
                      .format("csv")
                      .option("header", "true")
                      .option("mode", "DROPMALFORMED")
                      .csv("Order_201301.csv")

I now want to load all the columns in the file and want extract only the first part of the postcode and populate it into another column. This is the part I am struggling with.

val ordersNew = OrdersRAW.select("CarrierName","CarrierCustomerNumber","CarrierReference","CustomerReference","TransportDate","postcode".substring(0,4).trim(),"ProductDescription","ServiceDescription")

Any idea as to how to achieve this ? Thanks in advance for your help. Am using Spark 2.0+

OrdersRAW.select(trim(substring($"postcode", 0 , 4 ) ).as("postcode")).show — philantrovert
– philantrovert, Commented Nov 9, 2017 at 12:37

philantrovert · Accepted Answer · 2017-11-09 12:57:19Z

There's no need of UDF. Both functions are internally available
You syntax is incorrect for substring. [Hint: Check the docs]
You can just use withColumn and replace column postcode instead of selecting the entire column list.

OrdersRAW.show

+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
|  CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate|  postcode|ProductDescription|ServiceDescription|
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
|       DPD UK|               260432|      1.5503E+13|       JO01974834|    1/14/2013|   LU7 4QT|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015800315|   12/31/2012|  BS3  5DH|            PARCEL|            NXTDAY|
|          6ET|               PARCEL|   NXTDAY DPD UK|           260364|  1.55011E+13|C015939958|         1/21/2013|               SW6|
|           UK|               260363|     1.55012E+13|       C015854701|    1/10/2013|  RG41 2AN|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015945032|    1/22/2013|  RG5  4JB|            PARCEL|                  |
|NXTDAY DPD UK|               260268|     1.55017E+13|           967819|    1/11/2013|   HD1 2QE|            PARCEL|                  |
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+

val ordersNew = OrdersRAW.withColumn("postcode", trim(substring($"postcode", 0, 4) ) )

scala> ordersNew.show
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
|  CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate|postcode|ProductDescription|ServiceDescription|
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
|       DPD UK|               260432|      1.5503E+13|       JO01974834|    1/14/2013|     LU7|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015800315|   12/31/2012|     BS3|            PARCEL|            NXTDAY|
|          6ET|               PARCEL|   NXTDAY DPD UK|           260364|  1.55011E+13|    C015|         1/21/2013|               SW6|
|           UK|               260363|     1.55012E+13|       C015854701|    1/10/2013|    RG41|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015945032|    1/22/2013|     RG5|            PARCEL|                  |
|NXTDAY DPD UK|               260268|     1.55017E+13|           967819|    1/11/2013|     HD1|            PARCEL|                  |
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+

Thank you! This pretty much answers my question. This can now be closed.

Mehrez · Accepted Answer · 2017-11-09 13:34:20Z

0

you can use the spark UDF like this:

 import org.apache.spark.sql.functions._
 val postcodePrefix = udf((s: String) => s.substring(0,4).trim())
 OrdersRAW.withColumn("newColumnName", postcodePrefix(col("postcode")))

edited Nov 9, 2017 at 13:34

answered Nov 9, 2017 at 11:03

Mehrez

6958 silver badges14 bronze badges

3 Comments

philantrovert Over a year ago

There's no need of writing a UDF here. Both substring and trim are already available in Spark SQL.

eliasah Over a year ago

@nirali.gandhi please don't alter others answer like that! You'd rather comment and the author will see if it doesn't conflict with what he has already written.

Mehrez Over a year ago

yes sory for the code error in the udf function, sure the answer of @philantrovert is better

Collectives™ on Stack Overflow

Spark Scala - DataFrames & csv - partial extraction of values

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related