Efficient way of using for loops in scala

Question

I am trying to divide a data frame into n groups based on certain values of its columns. And ended up with the below code. But it doesnt look efficient interms of nested for loops, I am looking for some elegant approach in implementing the following code. Can some one please provide inputs?

Input will be column Names based on which the data frame should be divided. So I have a val storing in the distinct values of columns. It will store like :

 (0)(0) = F
(0)(1) = M
(1)(0) = drugY
(1)(1) = drugC
(1)(2) = drugX

So I have a total 5 created with column values as follows:

    F and drugY
M and drugY 
F and drugC
M and drugC
F and drugX
M and drugX

When you are doing these things with DataFrame, you do not need to worry about efficiency of for loops. Spark Tip 1 - Almost all operations on any DataFrame are very very expensive (relative to efficiency of for-loop). — sarveshseri
– sarveshseri, Commented Nov 28, 2016 at 22:07

oluies · Accepted Answer · 2016-11-29 01:00:01Z

1

I dont really understand what you want to do, but if you want to generate the combinations using the Spark dataframe api, you can do it like this

val patients = Seq(
    (1, "f"),
    (2, "m")
).toDF("id", "name")

val drugs = Seq(
    (1, "drugY"),
    (2, "drugC"),
    (3, "drugX")
).toDF("id", "name")

patients.createOrReplaceTempView("patients")
drugs.createOrReplaceTempView("drugs")

sqlContext.sql("select p.id as patient_id, p.name as patient_name, d.id as drug_id, d.name as drug_name  from patients p cross join drugs d").show



+----------+------------+-------+---------+
|patient_id|patient_name|drug_id|drug_name|
+----------+------------+-------+---------+
|         1|           f|      1|    drugY|
|         1|           f|      2|    drugC|
|         1|           f|      3|    drugX|
|         2|           m|      1|    drugY|
|         2|           m|      2|    drugC|
|         2|           m|      3|    drugX|
+----------+------------+-------+---------+

or with the dataframe api

val cartesian = patients.join(drugs)

cartesian.show
(2) Spark Jobs
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
|  1|   f|  1|drugY|
|  1|   f|  2|drugC|
|  1|   f|  3|drugX|
|  2|   m|  1|drugY|
|  2|   m|  2|drugC|
|  2|   m|  3|drugX|
+---+----+---+-----+

After that you can use a crosstab to get the a table of the frequency distribution

c.stat.crosstab("patient_name","drug_name").show

+----------------------+-----+-----+-----+
|patient_name_drug_name|drugC|drugX|drugY|
+----------------------+-----+-----+-----+
|                     m|    1|    1|    1|
|                     f|    1|    1|    1|
+----------------------+-----+-----+-----+

edited Nov 29, 2016 at 1:00

answered Nov 29, 2016 at 0:52

oluies

17.9k14 gold badges79 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Garipaso Over a year ago

Thnx for the update, but my requirement is on a single data frame where in I have to divide this into n sub-data frames based on certain columns. In your ex:, suppose the columns - patient_id,patient_name,drug_name are passed in as input. First, I will filter the df based on patient_id So I will have 2 dfs- df1 with patient_id =1 and df2 with patient_id =2 The second col is patient_name .I will filter these df1 and df2 for matching criteria of patient_name. So I will have 4 df: patient_id =1 patient_name=f, patient_id =1 patient_name=m, pat_id =2 pat_name=f, pat_id =2 pat_name=m

Garipaso Over a year ago

I will filter each of the above data frame for matching criteria of drug_name So I will have 12 data frames :: 1) patient_id =1 patient_name=f drug_name=drugY and soon. Once all these sub data frames are created based on the input condition given, I wil pickup few random samples from each data frame, this is what I'm trying to achieve through the above code. But I ended up with some for-loops I think it is not optimal. So looking for some suggestions on how this can achieved

oluies Over a year ago

why do you want many data frames? What do you with the dataframes?

Collectives™ on Stack Overflow

Efficient way of using for loops in scala

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related