0

I have a CSV file called sampleOrder.csv which looks like this:

CarrierName,CarrierCustomerNumber,CarrierReference,CustomerReference,TransportDate,postcode,ProductDescription,ServiceDescription
DPD UK,260432,1.5503E+13,JO01974834,1/14/2013,LU7 4QT,PARCEL,NXTDAY
DPD UK,260364,1.55011E+13,C015800315,12/31/2012,BS3  5DH,PARCEL,NXTDAY
DPD UK,260268, 15501675752897R,953902,1/15/2013,CV10 7RL,REVERSE
IT,NXTDAY DPD UK,260162,1.55017E+13,C015889556,1/14/2013,IP13
6ET,PARCEL,  NXTDAY DPD UK,260364,1.55011E+13,C015939958,1/21/2013,SW6
7JY,PARCEL,  NXTDAY DPD
UK,260363,1.55012E+13,C015854701,1/10/2013,RG41 2AN,PARCEL,  NXTDAY
DPD UK,260364,1.55011E+13,C015945032,1/22/2013,RG5  4JB,PARCEL, 
NXTDAY DPD UK,260268,1.55017E+13,967819,1/11/2013, HD1 2QE,PARCEL, 
NXTDAY DPD UK,260364,1.55011E+13,C015966537,1/24/2013,ST1  6SL,HOME
DELIVERY,AFNOON DPD UK,260364,
15500557912288R,C015821652,1/4/2013,CV10 7RL,SWAPIT,NXTDAY

I have created a spark SQL context and I load the csv file into a dataframe like this:

val OrdersRAW = spark.read
                      .format("csv")
                      .option("header", "true")
                      .option("mode", "DROPMALFORMED")
                      .csv("Order_201301.csv")

I now want to load all the columns in the file and want extract only the first part of the postcode and populate it into another column. This is the part I am struggling with.

val ordersNew = OrdersRAW.select("CarrierName","CarrierCustomerNumber","CarrierReference","CustomerReference","TransportDate","postcode".substring(0,4).trim(),"ProductDescription","ServiceDescription")

Any idea as to how to achieve this ? Thanks in advance for your help. Am using Spark 2.0+

3
  • 1
    OrdersRAW.select(trim(substring($"postcode", 0 , 4 ) ).as("postcode")).show Commented Nov 9, 2017 at 12:37
  • @philantrovert I suggest you post that as an answer Commented Nov 9, 2017 at 12:44
  • 1
    @eliasah alright, done. :) Commented Nov 9, 2017 at 12:58

2 Answers 2

1
  1. There's no need of UDF. Both functions are internally available
  2. You syntax is incorrect for substring. [Hint: Check the docs]
  3. You can just use withColumn and replace column postcode instead of selecting the entire column list.

OrdersRAW.show

+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
|  CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate|  postcode|ProductDescription|ServiceDescription|
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
|       DPD UK|               260432|      1.5503E+13|       JO01974834|    1/14/2013|   LU7 4QT|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015800315|   12/31/2012|  BS3  5DH|            PARCEL|            NXTDAY|
|          6ET|               PARCEL|   NXTDAY DPD UK|           260364|  1.55011E+13|C015939958|         1/21/2013|               SW6|
|           UK|               260363|     1.55012E+13|       C015854701|    1/10/2013|  RG41 2AN|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015945032|    1/22/2013|  RG5  4JB|            PARCEL|                  |
|NXTDAY DPD UK|               260268|     1.55017E+13|           967819|    1/11/2013|   HD1 2QE|            PARCEL|                  |
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+

val ordersNew = OrdersRAW.withColumn("postcode", trim(substring($"postcode", 0, 4) ) )

scala> ordersNew.show
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
|  CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate|postcode|ProductDescription|ServiceDescription|
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
|       DPD UK|               260432|      1.5503E+13|       JO01974834|    1/14/2013|     LU7|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015800315|   12/31/2012|     BS3|            PARCEL|            NXTDAY|
|          6ET|               PARCEL|   NXTDAY DPD UK|           260364|  1.55011E+13|    C015|         1/21/2013|               SW6|
|           UK|               260363|     1.55012E+13|       C015854701|    1/10/2013|    RG41|            PARCEL|            NXTDAY|
|       DPD UK|               260364|     1.55011E+13|       C015945032|    1/22/2013|     RG5|            PARCEL|                  |
|NXTDAY DPD UK|               260268|     1.55017E+13|           967819|    1/11/2013|     HD1|            PARCEL|                  |
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! This pretty much answers my question. This can now be closed.
Accept an answer and it'll get closed.
0

you can use the spark UDF like this:

 import org.apache.spark.sql.functions._
 val postcodePrefix = udf((s: String) => s.substring(0,4).trim())
 OrdersRAW.withColumn("newColumnName", postcodePrefix(col("postcode")))

3 Comments

There's no need of writing a UDF here. Both substring and trim are already available in Spark SQL.
@nirali.gandhi please don't alter others answer like that! You'd rather comment and the author will see if it doesn't conflict with what he has already written.
yes sory for the code error in the udf function, sure the answer of @philantrovert is better

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.