0

I have to call a function func_test(spark,a,b) which accepts two string values and create a df out of it. spark is a SparkSession variable These two string values are two columns of another dataframe and would be different for different rows of that dataframe.

I am unable to achieve this.

Things tried so far:
1.

ctry_df = func_test(spark, df.select("CTRY").first()["CTRY"],df.select("CITY").first()["CITY"])

Gives CTRY and CITY of only the first record of the df.

2.

ctry_df = func_test(spark, df['CTRY'],df['CITY'])

Gives Column<b'CTRY'> and Column<b'CITY'> as values.

Example: df is:

+----------+----------+-----------+
|     CTRY |     CITY |    XYZ    |
+----------+----------+-----------+
|      US  |     LA   |      HELLO|                                    
|      UK  |     LN   |      WORLD|
|      SN  |     SN   |      SPARK|
+----------+----------+-----------+

So, I want first call to fetch func_test(spark,US,LA); second call to go func_test(spark,UK,LN); third call to be func_test(spark,SN,SN) and so on.

Pyspark - 3.7
Spark - 2.2

Edit 1:

Issue in detail:

func_test(spark,string1,string2) is a function which accepts two string values. Inside this function is a set of various dataframe operations done. For example:- First spark sql in the func_test is a normal select and these two variables string1 and string2 are used in the where clause. The result of this spark sql which generates a df is a temp table of next spark sql and so on. Finally, it creates a df which this function func_test(spark,string1,string2) returns.

Now, In the main class, I have to call this func_test and the two parameters string1 and string2 will be fetched from records of dataframe. So that, first func_test call generates query as select * from dummy where CTRY='US' and CITY='LA'. And the subsequent operations happen which results in df. Second call to func_test becomes select * from dummy where CTRY='UK' and CITY='LN'. Third call becomes select * from dummy where CTRY='SN' and CITY='SN' and so on.

5
  • I didn't get your question exactly, but I think you want to define new column by function on another 2 column value, Am I right? Commented Nov 18, 2019 at 10:37
  • No. I dont want to create a new column. I want to create a new dataframe ctry_df which does not have any relation with df. But ctry_df must have columns which are result of operation performed by func_test(a,b) Commented Nov 18, 2019 at 10:40
  • Do you want to create a new dataframe with the same schema? Commented Nov 18, 2019 at 10:57
  • The schema would be different of the new dataframe. If it is unclear, I can try to explain the question in a more detailed form Commented Nov 18, 2019 at 10:58
  • yes I think it is complicated Commented Nov 18, 2019 at 15:07

1 Answer 1

1

instead of first() use collect() and iterate through the loop

collect_vals = df.select('CTRY','CITY').distinct().collect()
for row_col in collect_vals:
    func_test(spark, row_col['CTRY'],row_col['CITY'])

hope this helps !!

Sign up to request clarification or add additional context in comments.

7 Comments

Why do we need distinct here? The combination of CTRY and CITY could be repeated also in some other row of the df.
Removed distinct and now it is able to working close to expected. Thanks
I assume you need to use single combination once. but if you don't want to de-duplicate than distinct is not required.
one input needed in continuation to this:- On first call of CTRY and CITY, a df(one row) is returned. On next call for next CTRY and CITY, next df(one more row) is returned. Is there a way to stack all these df one over the other and create a final df (or a temp table) with as many rows at the end?
You simply make a join with DF from where u r returning the filtered values. Join on the same key country and city.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.