faster way of creating pandas dataframe from another dataframe

Question

I have a dataframe with over 41500 records and 3 fields: ID,start_date and end_date.

I want to create a separate dataframe out of it with just 2 fields as: ID and active_years which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).

This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.

df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0

for _, row in raw_dataset.iterrows():

    st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
    end_yr = int(row['end_date'].split('-')[0])

    for year in range(st_yr, end_yr+1):

        df.loc[ix, 'id'] = row['ID']
        df.loc[ix, 'active_years'] = year
        ix = ix + 1

So is there any faster way to achieve this?

[EDIT] some examples to try and work around,

raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})

print(raw_dataset)
     ID  start_date    end_date
0  a121  2019-10-09  2020-01-30
1  b142  2017-02-06  2019-08-23
2   cd3  2012-12-05  2016-06-18

# the desired dataframe should look like this
print(desired_df)
     id  active_years
0  a121  2019
1  a121  2020
2  b142  2017
3  b142  2018
4  b142  2019
5   cd3  2012
6   cd3  2013
7   cd3  2014
8   cd3  2015
9   cd3  2016

can you share with us minimalistic and runnable example showing your input and desired output? just sth as simple as this: small sample input and small sample output data. — Dariusz Krynicki
– Dariusz Krynicki, Commented Oct 4, 2019 at 7:32
Write a function to extract the years from the start_date and end_date columns and supply that function to a call to the .apply() method. — jamesoh
– jamesoh, Commented Oct 4, 2019 at 7:55
your method likely creates a dataframe with even more entries then your original one (although the strings are shorter...) - so I'd consider applying pd.to_datetime() to your columns 'start_date' / 'end_date' - you could then retrieve the years as raw_dataset['start_date'][idx].year. — FObersteiner
– FObersteiner, Commented Oct 4, 2019 at 8:06

Xukrao · Accepted Answer · 2019-10-04 08:42:32Z

2

Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes). See here for a brief explanation. With that in mind:

import pandas as pd

# Initialize input dataframe
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})

# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year

# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)

# Create result dataframe from lists
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})

print(desired_df)
# Output:
#     id  active_years
# 0  a121          2019
# 1  a121          2020
# 2  b142          2017
# 3  b142          2018
# 4  b142          2019
# 5   cd3          2012
# 6   cd3          2013
# 7   cd3          2014
# 8   cd3          2015
# 9   cd3          2016

edited Oct 4, 2019 at 8:42

answered Oct 4, 2019 at 8:07

Xukrao

8,6745 gold badges29 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

FObersteiner Over a year ago

that is what I also had in mind, but the question is: is the datetime conversion really faster than a simple string split and conversion to integer?

Xukrao Over a year ago

@MrFuppes I didn't check which method to obtain the integer start/end year values is the fastest. I'm reasonably certain however that this step is not the overall performance bottleneck. Based on my experience, the performance bottleneck (when working with dataframes with large amounts of rows) is going to be in the steps afterwards where the result dataframe is created.

FObersteiner Over a year ago

right! regarding the question, your note on the difference between Python lists and np arrays is actually the point here I'd say.

Collectives™ on Stack Overflow

faster way of creating pandas dataframe from another dataframe

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related