12

Can any tell me how to convert Spark dataframe into Array[String] in scala.

I have used the following.

x =df.select(columns.head, columns.tail: _*).collect()

The above snippet gives me an Array[Row] and not Array[String]

3
  • 2
    .map { row => row.toString() } ?? Commented Sep 9, 2017 at 20:44
  • thank you for the response. That does the work.Can you tell me if we use.map { row => row.toString() .mkString(",")} eliminate the bracket "[" and "]" when we print it out. Commented Sep 9, 2017 at 21:17
  • mkString makes a string from an array... This is all Scala knowledge. How would you make an array a sting in Java or Python? My point is, your question/problem is completely outside of Spark Commented Sep 9, 2017 at 21:49

4 Answers 4

14

This should do the trick:

df.select(columns: _*).collect.map(_.toSeq)
Sign up to request clarification or add additional context in comments.

1 Comment

How can we fix that columns is generating a Cannot resolve symbol compilation problem?
6

DataFrame to Array[String]

data.collect.map(_.toSeq).flatten

You can also use the following

data.collect.map(row=>row.getString(0)) 

If you have more columns then it is good to use the last one

 data.rdd.map(row=>row.getString(0)).collect

1 Comment

you can replace .map(.toSeq).flatten to .flatMap(.toSeq)
1

If you are planning to read the dataset line by line, then you can use the iterator over the dataset:

 Dataset<Row>csv=session.read().format("csv").option("sep",",").option("inferSchema",true).option("escape, "\"").option("header", true).option("multiline",true).load(users/abc/....);

for(Iterator<Row> iter = csv.toLocalIterator(); iter.hasNext();) {
    String[] item = ((iter.next()).toString().split(",");    
}

Comments

0

The answer was provided by a user named cricket_007. You can use the following to convert Array[Row] to Array[String] :

x =df.select(columns.head, columns.tail: _*).collect().map { row => row.toString() }

Thanks, Bharath

3 Comments

collect() on a DataFrame is often not how you would use it. Rather, you show() it
Hello cricket_007, I don't think show is useful in this case when you want to assign it a variable.
Not my point... Collecting the dataframe or RDD to a Scala datatype becomes a bottleneck on the driver process. If you just want to display the output, you select, then show it without a collection

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.