Add a header before text file on save in Spark

Question

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.

I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.

Sean Owen · Accepted Answer · 2014-10-02 09:21:48Z

10

You can make an RDD out of your header line and then union it, yes:

val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)

Then you end up with a bunch of part-xxxxx files that you merge.

The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.

More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.

answered Oct 2, 2014 at 9:21

Sean Owen

67.1k23 gold badges144 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

haltTm Over a year ago

In Spark 1.6.2 running in distributed mode, union did not put header on top for me. Here is my code snippet :-

val header = sc.parallelize(Array('col1','col2'), 1)     header.union(       rdd.map(_.toString)) .repartition(1).saveAsTextFile(outputLocation)

Tajh Taylor Over a year ago

same issue for me... union() is not guaranteed to preserve order. looking for a workaround now, looks like sorting the RDD might help

SriA · Accepted Answer · 2017-02-17 08:30:51Z

0

Some help on writing it without Union(Supplied the header at the time of merge)

val fileHeader ="This is header"
val fileHeaderStream: InputStream = new  ByteArrayInputStream(fileHeader.getBytes(StandardCharsets.UTF_8));
val output = IOUtils.copyBytes(fileHeaderStream,out,conf,false)

Now loop over you file parts to write the complete file using

val in: DataInputStream = ...<data input stream from file >
 IOUtils.copyBytes(in, output, conf, false)

This made sure for me that header always comes as first line even when you use coalasec/repartition for efficient writing

answered Feb 17, 2017 at 8:30

SriA

12 bronze badges

Comments

Dragan Nedeljkovic · Accepted Answer · 2017-06-11 01:28:49Z

0

def addHeaderToRdd(sparkCtx: SparkContext, lines: RDD[String], header: String): RDD[String] = {

    val headerRDD = sparkCtx.parallelize(List((-1L, header)))     // We index the header with -1, so that the sort will put it on top.

    val pairRDD = lines.zipWithIndex()

    val pairRDD2 = pairRDD.map(t => (t._2, t._1))

    val allRDD = pairRDD2.union(headerRDD)

    val allSortedRDD = allRDD.sortByKey()

    return allSortedRDD.values
}

edited Jun 11, 2017 at 1:28

answered Jun 10, 2017 at 4:27

Dragan Nedeljkovic

413 bronze badges

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

Slightly diff approach with Spark SQL

From Question: I now want to save this RDD as a CSV file and add a header. Each line of this RDD is already formatted correctly.

With Spark 2.x you have several options to convert RDD to DataFrame

val rdd = .... //Assume rdd properly formatted with case class or tuple
val df = spark.createDataFrame(rdd).toDF("col1", "col2", ... "coln")

df.write
  .format("csv")
  .option("header", "true")  //adds header to file
  .save("hdfs://location/to/save/csv")

Now we can even use Spark SQL DataFrame to load, transform and save CSV file

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Sep 27, 2017 at 8:22

mrsrinivas

35.6k13 gold badges133 silver badges132 bronze badges

Comments

berylqliu · Accepted Answer · 2017-12-21 08:42:53Z

0

spark.sparkContext.parallelize(Seq(SqlHelper.getARow(temRet.columns, 
temRet.columns.length))).union(temRet.rdd).map(x => 
x.mkString("\x01")).coalesce(1, true).saveAsTextFile(retPath)


object SqlHelper {
//create one row
def getARow(x: Array[String], size: Int): Row = {
var columnArray = new Array[String](size)
for (i <- 0 to (size - 1)) {
  columnArray(i) = x(i).toString()
}
Row.fromSeq(columnArray)
}
}

answered Dec 21, 2017 at 8:42

berylqliu

393 bronze badges

1 Comment

Neethu Lalitha Over a year ago

can someone write this in Java

Collectives™ on Stack Overflow

Add a header before text file on save in Spark

5 Answers 5

2 Comments

Comments

Comments

Slightly diff approach with Spark SQL

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

Slightly diff approach with Spark SQL

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related