0

I have an initial dataframe df that looks like this:

+-------+---+-----+------------------+----+-------------------+
|gender| pro|share|        prediction|week|     forecast_units|
+------+----+-----+------------------+----+-------------------+
|  Male|Polo| 0.01| 258.4054260253906|  37|             1809.0|
|  Male|Polo|  0.1| 332.4026794433594|  38|             2327.0|
|  Male|Polo| 0.15|425.97430419921875|  39|             2982.0|
|  Male|Polo|  0.2| 508.3385314941406|  40|             3558.0|
....

I have the following code that attempts to create multiple dataframes from the original dataframe by applying some calculus. Initial I create four empty dataframes and then I want to loop through four different weeks, c_weeks, and save the result from the calculus to each dataframe on the list_dfs:

schema = StructType([\
    StructField("gender", StringType(),True), \
    StructField("pro",StringType(),True), \
    StructField("units_1_tpr",DoubleType(),True), \
    StructField("units_1'_tpr",DoubleType(),True), \
    StructField("units_15_tpr",DoubleType(),True), \
    StructField("units_20_tpr",DoubleType(),True)])

df_wk1 = spark.createDataFrame([],schema=schema)
df_wk2 = spark.createDataFrame([],schema=schema)
df_wk3 = spark.createDataFrame([],schema=schema)
df_wk4 = spark.createDataFrame([],schema=schema)

list_dfs = [df_wk1, df_wk2, df_wk3, df_wk4]
c_weeks = [37, 38, 39, 40]

for data,weeknum in zip(list_dfs, campaign_weeks):
    data = df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units'))

In the end, the dataframes continue empty. How do fix this? If this way is not possible how can I implement what I want?

1 Answer 1

1

If you assign the result of df.filter(... to data it will be lost (actually, that line has no effect). Try this way:

df_wk1, df_wk2, df_wk3, df_wk4 = [
    df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units'))
    for weeknum in [37, 38, 39, 40]
]

However, df.filter(df.week == weeknum).groupBy(['gender', 'pro']).pivot("share").agg(first('forecast_units')) create a DataFrame with a different schema from the one you probably want (looking at your question).

This is an example of the DataFrame you get:

+------+----+------+
|gender| pro|   0.0|
+------+----+------+
|  Male|Polo|3558.0|
+------+----+------+

and this is its schema:

root
 |-- gender: string (nullable = true)
 |-- pro: string (nullable = true)
 |-- 0.0: double (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.