Adding a column to a dataframe using another columns

Question

I am new to spark and I have a dataframe as the following,

+-------------------+----------+--------------------+----------------+--------------+
|           placekey|naics_code|       visits_by_day|date_range_start|date_range_end|
+-------------------+----------+--------------------+----------------+--------------+
|22b-223@627-wdh-fcq|    311811|[22,16,22,32,44,1...|      2018-12-31|    2019-01-07|
|22b-222@627-wc3-99f|    311811|     [2,4,3,3,4,6,5]|      2019-01-28|    2019-02-04|
|222-222@627-w9g-rrk|    311811|     [3,3,5,5,6,2,5]|      2019-02-04|    2019-02-11|
+-------------------+----------+--------------------+----------------+--------------+

I want to create another column date_bet_dates that has a list of dates between date_range_start and date_range_end. This is the code I have so far,

def get_dates(s, e):
    start = datetime.strptime(s, '%Y-%m-%d').date()
    end = datetime.strptime(e, '%Y-%m-%d').date()
    return pd.date_range(start, end - timedelta(days=1),freq='d')

udf_get_dates  = udf(lambda x: get_dates(x), DateType())

df = df.withColumn('date_bet_dates', udf_get_dates(df['date_range_start'], df['date_range_end']))

df.show(3)

And an error occurs at the line df.show(3),

TypeError: <lambda>() takes 1 positional argument but 2 were given

I have no idea what arguments it is talking about, but I assume this is something to do with my get_dates function. What needs to be changed to solve this problem?

pltc · Accepted Answer · 2021-11-30 00:01:48Z

2

There are several wrong pieces in your UDF:

Your lambda took only one parameter lambda x: get_dates(x), while the designated function took two arguments def get_dates(s, e)
You expected the UDF to return a list of date return pd.date_range, but the UDF's return type is just DateType, not an ArrayType.

This is the fix

from pyspark.sql import functions as F
from pyspark.sql import types as T

def get_dates(s, e):
    start = datetime.strptime(s, '%Y-%m-%d').date()
    end = datetime.strptime(e, '%Y-%m-%d').date()
    return pd.date_range(start, end - timedelta(days=1),freq='d').date.tolist() # you return a list here

udf_get_dates  = F.udf(lambda x, y: get_dates(x, y), T.ArrayType(T.DateType())) # then call lambda with 2 arguments here

df = df.withColumn('date_bet_dates', udf_get_dates(df['date_range_start'], df['date_range_end'])) # finally, trigger the UDF here

answered Nov 30, 2021 at 0:01

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pltc Over a year ago

my bad, updated. It's always a good practice to import functions as F and types as T

9ganzi Over a year ago

I realized what it is right after I asked. Sorry it's my bad, I used DateType myself, I should have figured out lol

rchome · Accepted Answer · 2021-11-29 23:39:08Z

1

get_dates() takes the two arguments s and e. You've wrapped it in a lambda which only takes one argument x. Get rid of the lambda and you should be good.

udf_get_dates  = udf(get_dates, DateType())

answered Nov 29, 2021 at 23:39

rchome

2,72310 silver badges22 bronze badges

Comments

Daeho Ro · Accepted Answer · 2021-11-30 01:11:00Z

If I do the same thing, I will use the sequence function.

df.withColumn('date_bet_dates', f.expr('sequence(to_date(date_range_start), to_date(date_range_end), interval 1 days)')).show(truncate=False)

+----------------+--------------+------------------------------------------------------------------------------------------------+
|date_range_start|date_range_end|date_bet_dates                                                                                  |
+----------------+--------------+------------------------------------------------------------------------------------------------+
|2018-12-31      |2019-01-07    |[2018-12-31, 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-05, 2019-01-06, 2019-01-07]|
|2019-01-28      |2019-02-04    |[2019-01-28, 2019-01-29, 2019-01-30, 2019-01-31, 2019-02-01, 2019-02-02, 2019-02-03, 2019-02-04]|
|2019-02-04      |2019-02-11    |[2019-02-04, 2019-02-05, 2019-02-06, 2019-02-07, 2019-02-08, 2019-02-09, 2019-02-10, 2019-02-11]|
+----------------+--------------+------------------------------------------------------------------------------------------------+

Collectives™ on Stack Overflow

Adding a column to a dataframe using another columns

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related