split an apache-spark dataframe string column into multiple columns by slicing/splitting on field width values stored in a list

Question

I have a dataframe that looks like this

+--------------------
|       unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...

I need to get it split it up into something like this

+--------------------
|fips  | Name     | Id ...    
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...

I have a list like this

val fields = List(
  ("fips", 5),
  (“Name”, 8),
  (“Id”, 27)
  ....more fields
)

I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.

My scala is still pretty week and I assume the answer would look something like this

df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)

any suggestions/ideas much appreciated

Leo C · Accepted Answer · 2018-07-12 05:09:53Z

3

You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using substring. It applies regardless of the size of the fields list:

import org.apache.spark.sql.functions._

val df = Seq(
  ("02020sometext5002"),
  ("03030othrtext6003"),
  ("04040moretext7004")
).toDF("unparsed_data")

val fields = List(
  ("fips", 5),
  ("name", 8),
  ("id", 4)
)

val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
    val newDF = acc._1.withColumn(
      field._1, substring($"unparsed_data", acc._2, field._2)
    )
    (newDF, acc._2 + field._2)
  }._1.
  drop("unparsed_data")

resultDF.show
// +-----+--------+----+
// | fips|    name|  id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+

Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.

answered Jul 12, 2018 at 5:09

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ged Over a year ago

I have to think about this. I knew u would come the fold and that it was somehow possible. I get the impression that scala allows virtually everything.

Ged · Accepted Answer · 2018-07-11 20:08:50Z

This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.

import org.apache.spark.sql.functions._

val df = Seq(
       ("12334sometext999")
       ).toDF("X")

 val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")

 df2.show

Gives in this case (you can rename cols again):

+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
|             12334|          sometext|                999|
+------------------+------------------+-------------------+

Collectives™ on Stack Overflow

split an apache-spark dataframe string column into multiple columns by slicing/splitting on field width values stored in a list

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related