I am new to spark and I have a dataframe as the following,
+-------------------+----------+--------------------+----------------+--------------+
| placekey|naics_code| visits_by_day|date_range_start|date_range_end|
+-------------------+----------+--------------------+----------------+--------------+
|22b-223@627-wdh-fcq| 311811|[22,16,22,32,44,1...| 2018-12-31| 2019-01-07|
|22b-222@627-wc3-99f| 311811| [2,4,3,3,4,6,5]| 2019-01-28| 2019-02-04|
|222-222@627-w9g-rrk| 311811| [3,3,5,5,6,2,5]| 2019-02-04| 2019-02-11|
+-------------------+----------+--------------------+----------------+--------------+
I want to create another column date_bet_dates that has a list of dates between date_range_start and date_range_end. This is the code I have so far,
def get_dates(s, e):
start = datetime.strptime(s, '%Y-%m-%d').date()
end = datetime.strptime(e, '%Y-%m-%d').date()
return pd.date_range(start, end - timedelta(days=1),freq='d')
udf_get_dates = udf(lambda x: get_dates(x), DateType())
df = df.withColumn('date_bet_dates', udf_get_dates(df['date_range_start'], df['date_range_end']))
df.show(3)
And an error occurs at the line df.show(3),
TypeError: <lambda>() takes 1 positional argument but 2 were given
I have no idea what arguments it is talking about, but I assume this is something to do with my get_dates function. What needs to be changed to solve this problem?