Select Columns in Spark Dataframe based on Column name pattern

Question

I have a spark dataframe with the following column structure:

UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY

In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q etc

I want to select UT_LVL_17_CD,UT_LVL_20_CD and columns which has the pattern year<space>quarter like 2017 1Q. Basically I want to avoid selecting columns like 2017 FY , 2018 FY , and this has to be dynamic as I will get new FY data each year

I am using spark 2.4.4

you have format issue with your post. it's difficult to understand... — Balaji Reddy
– Balaji Reddy, Commented Nov 27, 2019 at 8:34

eliasah · Accepted Answer · 2019-11-27 14:42:01Z

Like I stated in my comment, this can be done with plain scala using Regex since the DataFrame can return columns names as a Seq[String] :

scala> val columns = df.columns 
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)

scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$

scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)

You can check that the selected columns does not contain the unwanted columns :

scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)

Now you can use the selection :

scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]

Tejas · Accepted Answer · 2019-11-27 10:23:38Z

1

You could use desc sql command to get list of column names

    val fyStringList=new util.ArrayList[String]()
    spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
    println(fyStringList)

Use above snippet to get list of column name which contains "fy" You can update filter logic with regex and also update logic in forEach for storing string columns

answered Nov 27, 2019 at 10:23

Tejas

4114 silver badges12 bronze badges

Comments

Brian · Accepted Answer · 2021-04-28 11:19:47Z

1

you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.

var DF1 =  DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)

This will remove those FY related columns. Hope this works for you.

edited Apr 28, 2021 at 11:19

Brian

336 bronze badges

answered Nov 28, 2019 at 13:46

Ruthika jawar

213 bronze badges

1 Comment

SanjanaSanju Over a year ago

I was using your answer and it was working fine. Below is my command.select(final_df.columns.filter(x => x.contains("split")).map(final_df()) : _*) In this im taking all the columns that contains split in the column name. What if, I want include few more columns in the output, could you please let me know how to achieve this.

Collectives™ on Stack Overflow

Select Columns in Spark Dataframe based on Column name pattern

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related