0

I have a file which is comma seperated. Lets assume i have Accounts file and i have the following data

AcctId, AcctName, City, State, Deductible
1,ABC,Fremont,CA,4000
1,DEF,UnionCity,CA,10000
2,FFF, Hayward,CA,2323

I want to have a dataset or a list which has AcctId,Count as 
1,2
2,1

I have the following code

val df: DataFrame = sqlContext.read
          .format("com.databricks.spark.csv")
          .option("header", true) // Use first line of all files as header
          .option("delimiter", ",")
          .option("inferSchema", "true") // Automatically infer data types
          .load(file)

        val accGrpCountsDs = df.groupByKey(acctId => acctId).count()

I am doing this operation in a loop for 8 files and i am updating the counts in a concurrent map since the acctid is present in all 8 files. Count in the map is a cumulative sum. The 8 files are expected to have millions of rows.

I have these questions

Whats the best way to achieve this. Is GroupByKey Better or ReduceByKey. Should i use RDD or dataframe. ?

Can you please share examples

Thanks

1 Answer 1

1

Just use df.groupBy("AcctId").count. This way you will avoid deserialization from Tungsten. Also, you will get a data frame as output.

By the way, consider reading the whole directory instead of single csv files one-by-one.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, i am using groubby, but do you know if groub by is fast or reduce by. what do you suggest?
groupBy() is a DataFrame method (not groupByKey()!). It may be fast as soon as it uses Tungsten. But it may be slow as well (e.g. if you have a lot of rows for each key and you want to apply some advanced functions). It depends on your actual problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.