1

I have a spark dataframe with the following column structure:

UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY

In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q etc

I want to select UT_LVL_17_CD,UT_LVL_20_CD and columns which has the pattern year<space>quarter like 2017 1Q. Basically I want to avoid selecting columns like 2017 FY , 2018 FY , and this has to be dynamic as I will get new FY data each year

I am using spark 2.4.4

2
  • 1
    please update you question format properly Commented Nov 27, 2019 at 8:24
  • you have format issue with your post. it's difficult to understand... Commented Nov 27, 2019 at 8:34

3 Answers 3

5

Like I stated in my comment, this can be done with plain scala using Regex since the DataFrame can return columns names as a Seq[String] :

scala> val columns = df.columns 
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)

scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$

scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)

You can check that the selected columns does not contain the unwanted columns :

scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)

Now you can use the selection :

scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]
Sign up to request clarification or add additional context in comments.

Comments

1

You could use desc sql command to get list of column names

    val fyStringList=new util.ArrayList[String]()
    spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
    println(fyStringList)

Use above snippet to get list of column name which contains "fy" You can update filter logic with regex and also update logic in forEach for storing string columns

Comments

1

you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.

var DF1 =  DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)

This will remove those FY related columns. Hope this works for you.

1 Comment

I was using your answer and it was working fine. Below is my command.select(final_df.columns.filter(x => x.contains("split")).map(final_df()) : _*) In this im taking all the columns that contains split in the column name. What if, I want include few more columns in the output, could you please let me know how to achieve this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.