how to select columns from list dynamically in dataframe plus a fixed column

Question

I'm using spark-sql-2.4.1v with java8.

I have dynamic list of columns is are passed into my function.

i.e.

List<String> cols = Arrays.asList("col_1","col_2","col_3","col_4");
Dataset<Row> df = //which has above columns plus "id" ,"name" plus many other columns;

Need to select cols + "id" + "name"

I am doing as below

Dataset<Row> res_df = df.select("id", "name", cols.stream().toArray( String[]::new));

this is giving compilation error. so how to handle this use-case.

Tried :

When I do something like below :

List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");

Giving error

Exception in thread "main" java.lang.UnsupportedOperationException
    at java.util.AbstractList.add(AbstractList.java:148)
    at java.util.AbstractList.add(AbstractList.java:108)

You get UnsupportedOperationException because actual type of List you're using is Arrays.ArrayList (returned from Arrays.asList) and not util.ArrayList. — morsik
– morsik, Commented Mar 31, 2020 at 8:07

chlebek · Accepted Answer · 2020-03-30 16:13:46Z

1

You could create array of Columns and pass it to the select statement.

import org.apache.spark.sql.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

List<String> cols = new ArrayList<>(Arrays.asList("col_1","col_2","col_3","col_4"));
cols.add("id");
cols.add("name");
Column[] cols2 = cols.stream()
        .map(s->new Column(s)).collect(Collectors.toList())
        .toArray(new Column[0]);

settingsDataset.select(cols2).show();

edited Mar 30, 2020 at 16:13

answered Mar 30, 2020 at 15:12

chlebek

2,4511 gold badge10 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

chlebek Over a year ago

try to change List to ArrayList<String> cols and add these imports import java.util.ArrayList; import java.util.Arrays; import java.util.List;

morsik · Accepted Answer · 2020-03-31 08:05:16Z

1

You have a bunch of ways to achieve this, relying on different select method signatures.

One of the possible solutions, with the assumption cols List is immutable and is not controlled by your code:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import scala.collection.JavaConverters;

public class ATest {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .master("local[2]")
                .getOrCreate();

        List<String> cols = Arrays.asList("col_1", "col_2");

        Dataset<Row> df = spark.sql("select 42 as ID, 'John' as NAME, 1 as col_1, 2 as col_2, 3 as col_3, 4 as col4");
        df.show();

        ArrayList<String> newCols = new ArrayList<>();
        newCols.add("NAME");
        newCols.addAll(cols);
        df.select("ID", JavaConverters.asScalaIteratorConverter(newCols.iterator()).asScala().toSeq())
                .show();
    }
}

edited Mar 31, 2020 at 8:05

answered Mar 30, 2020 at 15:02

morsik

1,3001 gold badge14 silver badges17 bronze badges

1 Comment

morsik Over a year ago

@BdEngineer I've updated the post with a full working example.

Collectives™ on Stack Overflow

how to select columns from list dynamically in dataframe plus a fixed column

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related