How can I arrange the rows and the columns in Spark using scala [duplicate]

Question

I want from a text file in format:

first line
column1;column2;column3
column1;column2;column3
last line

to convert it into DataFrame without the first and the last line I have skippet the first and the last line but then I become the rest text in one row and onw column How can I arrange the rows? I have also a schema for my DataFrame

var textFile = sc.textFile("*.txt")
val header = textFile.first()
val total = textFile.count()
var rows = textFile.zipWithIndex().filter(x => x._2 < total - 1).map(x => x._1).filter(x => x !=  header)

val schema = StructType(Array(
  StructField("col1", IntegerType, true),
  StructField("col2", StringType, true),
  StructField("col3", StringType, true),
  StructField("col4", StringType, true)
))

You should split the rest of the text with ; and then convert them Row and apply the schema to create the dataframe — Anahcolus
– Anahcolus, Commented Apr 19, 2018 at 9:46
yes, I have done it: import spark.implicits._ val rowss = rows.map(x => {val m = x.split(","); Row(m(0), m(1), m(2), m(3))}) val df = rowss.toDF().show() but toDF() is not working.. — Malo
– Malo, Commented Apr 19, 2018 at 10:02
with ; is still not working I also tried with spark.createDataFrame(rowRDD, schema), but I become a lot of errors — Malo
– Malo, Commented Apr 19, 2018 at 10:30

Anahcolus · Accepted Answer · 2018-04-19 11:09:28Z

0

You should be doing the following (commented for clarity)

//creating schema
import org.apache.spark.sql.types._
val schema = StructType(Array(
  StructField("col1", StringType, true),
  StructField("col2", StringType, true),
  StructField("col3", StringType, true)
))

//reading text file and finding total lines
val textFile = sc.textFile("*.txt")
val total = textFile.count()

//indexing lines for filtering the first and the last lines
import org.apache.spark.sql.Row
val rows = textFile.zipWithIndex()
    .filter(x => x._2 != 0 && x._2 < total - 1)
  .map(x => Row.fromSeq(x._1.split(";").toSeq))   //converting the lines to Row of sequences

//finally creating dataframe
val df = sqlContext.createDataFrame(rows, schema)
df.show(false)

which should give you

+-------+-------+-------+
|col1   |col2   |col3   |
+-------+-------+-------+
|column1|column2|column3|
|column1|column2|column3|
+-------+-------+-------+

answered Apr 19, 2018 at 11:09

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Malo Over a year ago

It's working, thank you!

Collectives™ on Stack Overflow

How can I arrange the rows and the columns in Spark using scala [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related