Since spark runs in distributed mode, you cannot add column based values on array with index. Suppose spark runs with two workers and John and Elizabeth deliver to worker A and Eric deliver to worker B. Indeed, they will split when save in dataframe. The workers don't know what is the index of John,Elizabeth or Eric. You can do what you want simply in a normal java single program.
In your example you need to convert your array to dataframe and use join to merge two dataframes based a column with the same value. However, you can use crossJoin to do a cartesian product on your tables.
Dataset<Row> ndf = df.crossJoin(df2);
If you need just add a column with a constant value or a value based another column on the same dataframe, use withColumn as below:
Dataset<Row> ndf = df.withColumn("city",functions.lit(1));
Dataset<Row> ndf = df.withColumn("city",functions.rand());
Dataset<Row> ndf = df.withColumn("city",functions.col("name"));
At last, you can use Atomic like this to get what you want. I test it in spark single mode.
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "H:\\work\\HadoopWinUtils\\");
SparkSession spark = SparkSession
.builder()
.master("local[*]")
.getOrCreate();
Dataset<Row> df = spark.read().json("H:\\work\\HadoopWinUtils\\people.json");
List<String> city_array = Arrays.asList("LA", "AZ", "OH");
// Displays the content of the DataFrame to stdout
df.show();
df = df.withColumn("city",functions.col("name"));
AtomicInteger i= new AtomicInteger();
Dataset<Row> df3 = df.map((MapFunction<Row, Row>) value -> {
return RowFactory.create(value.get(0),value.get(1),city_array.get(i.getAndIncrement()));
//return city_array.get(i.getAndIncrement());
}, RowEncoder.apply(df.schema()));
df3.show();
}
People is
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
and the result is
+----+-------+----+
| age| name|city|
+----+-------+----+
|null|Michael| LA|
| 30| Andy| AZ|
| 19| Justin| OH|
+----+-------+----+