I have a file which is comma seperated. Lets assume i have Accounts file and i have the following data
AcctId, AcctName, City, State, Deductible
1,ABC,Fremont,CA,4000
1,DEF,UnionCity,CA,10000
2,FFF, Hayward,CA,2323
I want to have a dataset or a list which has AcctId,Count as
1,2
2,1
I have the following code
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", true) // Use first line of all files as header
.option("delimiter", ",")
.option("inferSchema", "true") // Automatically infer data types
.load(file)
val accGrpCountsDs = df.groupByKey(acctId => acctId).count()
I am doing this operation in a loop for 8 files and i am updating the counts in a concurrent map since the acctid is present in all 8 files. Count in the map is a cumulative sum. The 8 files are expected to have millions of rows.
I have these questions
Whats the best way to achieve this. Is GroupByKey Better or ReduceByKey. Should i use RDD or dataframe. ?
Can you please share examples
Thanks