Getting the row count by key from dataframe / RDD using spark

Question

I have a file which is comma seperated. Lets assume i have Accounts file and i have the following data

AcctId, AcctName, City, State, Deductible
1,ABC,Fremont,CA,4000
1,DEF,UnionCity,CA,10000
2,FFF, Hayward,CA,2323

I want to have a dataset or a list which has AcctId,Count as 
1,2
2,1

I have the following code

val df: DataFrame = sqlContext.read
          .format("com.databricks.spark.csv")
          .option("header", true) // Use first line of all files as header
          .option("delimiter", ",")
          .option("inferSchema", "true") // Automatically infer data types
          .load(file)

        val accGrpCountsDs = df.groupByKey(acctId => acctId).count()

I am doing this operation in a loop for 8 files and i am updating the counts in a concurrent map since the acctid is present in all 8 files. Count in the map is a cumulative sum. The 8 files are expected to have millions of rows.

I have these questions

Whats the best way to achieve this. Is GroupByKey Better or ReduceByKey. Should i use RDD or dataframe. ?

Can you please share examples

Thanks

Pavel Filatov · Accepted Answer · 2019-06-19 18:34:20Z

1

Just use df.groupBy("AcctId").count. This way you will avoid deserialization from Tungsten. Also, you will get a data frame as output.

By the way, consider reading the whole directory instead of single csv files one-by-one.

answered Jun 19, 2019 at 18:34

Pavel Filatov

5963 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3897533 Over a year ago

Thanks, i am using groubby, but do you know if groub by is fast or reduce by. what do you suggest?

Pavel Filatov Over a year ago

groupBy() is a DataFrame method (not groupByKey()!). It may be fast as soon as it uses Tungsten. But it may be slow as well (e.g. if you have a lot of rows for each key and you want to apply some advanced functions). It depends on your actual problem.

Collectives™ on Stack Overflow

Getting the row count by key from dataframe / RDD using spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related