1

I am trying to divide a data frame into n groups based on certain values of its columns. And ended up with the below code. But it doesnt look efficient interms of nested for loops, I am looking for some elegant approach in implementing the following code. Can some one please provide inputs?

Input will be column Names based on which the data frame should be divided. So I have a val storing in the distinct values of columns. It will store like :

 (0)(0) = F
(0)(1) = M
(1)(0) = drugY
(1)(1) = drugC
(1)(2) = drugX

So I have a total 5 created with column values as follows:

    F and drugY
M and drugY 
F and drugC
M and drugC
F and drugX
M and drugX
1
  • When you are doing these things with DataFrame, you do not need to worry about efficiency of for loops. Spark Tip 1 - Almost all operations on any DataFrame are very very expensive (relative to efficiency of for-loop). Commented Nov 28, 2016 at 22:07

1 Answer 1

1

I dont really understand what you want to do, but if you want to generate the combinations using the Spark dataframe api, you can do it like this

val patients = Seq(
    (1, "f"),
    (2, "m")
).toDF("id", "name")

val drugs = Seq(
    (1, "drugY"),
    (2, "drugC"),
    (3, "drugX")
).toDF("id", "name")

patients.createOrReplaceTempView("patients")
drugs.createOrReplaceTempView("drugs")

sqlContext.sql("select p.id as patient_id, p.name as patient_name, d.id as drug_id, d.name as drug_name  from patients p cross join drugs d").show



+----------+------------+-------+---------+
|patient_id|patient_name|drug_id|drug_name|
+----------+------------+-------+---------+
|         1|           f|      1|    drugY|
|         1|           f|      2|    drugC|
|         1|           f|      3|    drugX|
|         2|           m|      1|    drugY|
|         2|           m|      2|    drugC|
|         2|           m|      3|    drugX|
+----------+------------+-------+---------+

or with the dataframe api

val cartesian = patients.join(drugs)

cartesian.show
(2) Spark Jobs
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
|  1|   f|  1|drugY|
|  1|   f|  2|drugC|
|  1|   f|  3|drugX|
|  2|   m|  1|drugY|
|  2|   m|  2|drugC|
|  2|   m|  3|drugX|
+---+----+---+-----+

After that you can use a crosstab to get the a table of the frequency distribution

c.stat.crosstab("patient_name","drug_name").show

+----------------------+-----+-----+-----+
|patient_name_drug_name|drugC|drugX|drugY|
+----------------------+-----+-----+-----+
|                     m|    1|    1|    1|
|                     f|    1|    1|    1|
+----------------------+-----+-----+-----+
Sign up to request clarification or add additional context in comments.

3 Comments

Thnx for the update, but my requirement is on a single data frame where in I have to divide this into n sub-data frames based on certain columns. In your ex:, suppose the columns - patient_id,patient_name,drug_name are passed in as input. First, I will filter the df based on patient_id So I will have 2 dfs- df1 with patient_id =1 and df2 with patient_id =2 The second col is patient_name .I will filter these df1 and df2 for matching criteria of patient_name. So I will have 4 df: patient_id =1 patient_name=f, patient_id =1 patient_name=m, pat_id =2 pat_name=f, pat_id =2 pat_name=m
I will filter each of the above data frame for matching criteria of drug_name So I will have 12 data frames :: 1) patient_id =1 patient_name=f drug_name=drugY and soon. Once all these sub data frames are created based on the input condition given, I wil pickup few random samples from each data frame, this is what I'm trying to achieve through the above code. But I ended up with some for-loops I think it is not optimal. So looking for some suggestions on how this can achieved
why do you want many data frames? What do you with the dataframes?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.