2

I have a dataframe that looks like this

+--------------------
|       unparsed_data|
+--------------------
|02020sometext5002...
|02020sometext6682...

I need to get it split it up into something like this

+--------------------
|fips  | Name     | Id ...    
+--------------------
|02020 | sometext | 5002...
|02020 | sometext | 6682...

I have a list like this

val fields = List(
  ("fips", 5),
  (“Name”, 8),
  (“Id”, 27)
  ....more fields
)

I need the spit to take the first 5 characters in unparsed_data and map it to fips, take the next 8 characters in unparsed_data and map it to Name, then the next 27 characters and map them to Id and so on. I need the split to use/reference the filed lengths supplied in the list to do the splitting/slicing as there are allot of fields and the unparsed_data field is very long.

My scala is still pretty week and I assume the answer would look something like this

df.withColumn("temp_field", split("unparsed_data", //some regex created from the list values?)).map(i => //some mapping to the field names in the list)

any suggestions/ideas much appreciated

2 Answers 2

3

You can use foldLeft to traverse your fields list to iteratively create columns from the original DataFrame using substring. It applies regardless of the size of the fields list:

import org.apache.spark.sql.functions._

val df = Seq(
  ("02020sometext5002"),
  ("03030othrtext6003"),
  ("04040moretext7004")
).toDF("unparsed_data")

val fields = List(
  ("fips", 5),
  ("name", 8),
  ("id", 4)
)

val resultDF = fields.foldLeft( (df, 1) ){ (acc, field) =>
    val newDF = acc._1.withColumn(
      field._1, substring($"unparsed_data", acc._2, field._2)
    )
    (newDF, acc._2 + field._2)
  }._1.
  drop("unparsed_data")

resultDF.show
// +-----+--------+----+
// | fips|    name|  id|
// +-----+--------+----+
// |02020|sometext|5002|
// |03030|othrtext|6003|
// |04040|moretext|7004|
// +-----+--------+----+

Note that a Tuple2[DataFrame, Int] is used as the accumulator for foldLeft to carry both the iteratively transformed DataFrame and next offset position for substring.

Sign up to request clarification or add additional context in comments.

1 Comment

I have to think about this. I knew u would come the fold and that it was somehow possible. I get the impression that scala allows virtually everything.
1

This can get you going. Depending on your needs it can get more and more complicated with variable lengths etc. which you do not state. But you can I think use column list.

import org.apache.spark.sql.functions._

val df = Seq(
       ("12334sometext999")
       ).toDF("X")

 val df2 = df.selectExpr("substring(X, 0, 5)", "substring(X, 6,8)", "substring(X, 14,3)")

 df2.show

Gives in this case (you can rename cols again):

+------------------+------------------+-------------------+
|substring(X, 0, 5)|substring(X, 6, 8)|substring(X, 14, 3)|
+------------------+------------------+-------------------+
|             12334|          sometext|                999|
+------------------+------------------+-------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.