I have a CSV file called sampleOrder.csv which looks like this:
CarrierName,CarrierCustomerNumber,CarrierReference,CustomerReference,TransportDate,postcode,ProductDescription,ServiceDescription
DPD UK,260432,1.5503E+13,JO01974834,1/14/2013,LU7 4QT,PARCEL,NXTDAY
DPD UK,260364,1.55011E+13,C015800315,12/31/2012,BS3 5DH,PARCEL,NXTDAY
DPD UK,260268, 15501675752897R,953902,1/15/2013,CV10 7RL,REVERSE
IT,NXTDAY DPD UK,260162,1.55017E+13,C015889556,1/14/2013,IP13
6ET,PARCEL, NXTDAY DPD UK,260364,1.55011E+13,C015939958,1/21/2013,SW6
7JY,PARCEL, NXTDAY DPD
UK,260363,1.55012E+13,C015854701,1/10/2013,RG41 2AN,PARCEL, NXTDAY
DPD UK,260364,1.55011E+13,C015945032,1/22/2013,RG5 4JB,PARCEL,
NXTDAY DPD UK,260268,1.55017E+13,967819,1/11/2013, HD1 2QE,PARCEL,
NXTDAY DPD UK,260364,1.55011E+13,C015966537,1/24/2013,ST1 6SL,HOME
DELIVERY,AFNOON DPD UK,260364,
15500557912288R,C015821652,1/4/2013,CV10 7RL,SWAPIT,NXTDAY
I have created a spark SQL context and I load the csv file into a dataframe like this:
val OrdersRAW = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.csv("Order_201301.csv")
I now want to load all the columns in the file and want extract only the first part of the postcode and populate it into another column. This is the part I am struggling with.
val ordersNew = OrdersRAW.select("CarrierName","CarrierCustomerNumber","CarrierReference","CustomerReference","TransportDate","postcode".substring(0,4).trim(),"ProductDescription","ServiceDescription")
Any idea as to how to achieve this ? Thanks in advance for your help. Am using Spark 2.0+
OrdersRAW.select(trim(substring($"postcode", 0 , 4 ) ).as("postcode")).show