I am new to scala and I am trying to implement a code for first of all reading list of files in a folder and then loading each one of those CSV files in HDFS.
As of now I am iterating through all CSV files using for loop, but I want to implement this using multithreading such that each thread takes care of each file and perform end to end process on respective file.
My current implementation is:
val fileArray: Array[File] = new java.io.File(source).listFiles.filter(_.getName.endsWith(".csv"))
for(file<-fileArray){
// reading csv file from shared location and taking whole data in a dataframe
var df = loadCSV2DF(sqlContext, fileFormat, "true", "true", file.getName)
// variable for holding destination location : HDFS Location
var finalDestination: String = destination+file.getName
// saving data into HDFS
writeDF2HDFS(df,fileFormat,"true",finalDestination) /// saved using default number of partition = 1
}
I was trying to look into Future API of scala but was not able to understand it's usage properly as of now.
Any pointers on how Future API of scala could help me here would be a great help.
Regards, Bhupesh
loadCSV2DFandwriteDF2HDFS? With Futures the basic approach would look likefileArray.map(file => loadCSV2DF(...)).flatMap(df => writeDF2HDFS(...)), butloadCSV2DFandwriteDF2HDFSmust return futures in that case.