0

I have a dataframe with over 41500 records and 3 fields: ID,start_date and end_date.

I want to create a separate dataframe out of it with just 2 fields as: ID and active_years which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).

This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.

df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0

for _, row in raw_dataset.iterrows():

    st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
    end_yr = int(row['end_date'].split('-')[0])

    for year in range(st_yr, end_yr+1):

        df.loc[ix, 'id'] = row['ID']
        df.loc[ix, 'active_years'] = year
        ix = ix + 1

So is there any faster way to achieve this?

[EDIT] some examples to try and work around,

raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})

print(raw_dataset)
     ID  start_date    end_date
0  a121  2019-10-09  2020-01-30
1  b142  2017-02-06  2019-08-23
2   cd3  2012-12-05  2016-06-18

# the desired dataframe should look like this
print(desired_df)
     id  active_years
0  a121  2019
1  a121  2020
2  b142  2017
3  b142  2018
4  b142  2019
5   cd3  2012
6   cd3  2013
7   cd3  2014
8   cd3  2015
9   cd3  2016
4
  • can you share with us minimalistic and runnable example showing your input and desired output? just sth as simple as this: small sample input and small sample output data. Commented Oct 4, 2019 at 7:32
  • @szerszen I added some examples to help you get an idea Commented Oct 4, 2019 at 7:43
  • Write a function to extract the years from the start_date and end_date columns and supply that function to a call to the .apply() method. Commented Oct 4, 2019 at 7:55
  • your method likely creates a dataframe with even more entries then your original one (although the strings are shorter...) - so I'd consider applying pd.to_datetime() to your columns 'start_date' / 'end_date' - you could then retrieve the years as raw_dataset['start_date'][idx].year. Commented Oct 4, 2019 at 8:06

1 Answer 1

2

Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes). See here for a brief explanation. With that in mind:

import pandas as pd

# Initialize input dataframe
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})

# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year

# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)

# Create result dataframe from lists
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})

print(desired_df)
# Output:
#     id  active_years
# 0  a121          2019
# 1  a121          2020
# 2  b142          2017
# 3  b142          2018
# 4  b142          2019
# 5   cd3          2012
# 6   cd3          2013
# 7   cd3          2014
# 8   cd3          2015
# 9   cd3          2016
Sign up to request clarification or add additional context in comments.

3 Comments

that is what I also had in mind, but the question is: is the datetime conversion really faster than a simple string split and conversion to integer?
@MrFuppes I didn't check which method to obtain the integer start/end year values is the fastest. I'm reasonably certain however that this step is not the overall performance bottleneck. Based on my experience, the performance bottleneck (when working with dataframes with large amounts of rows) is going to be in the steps afterwards where the result dataframe is created.
right! regarding the question, your note on the difference between Python lists and np arrays is actually the point here I'd say.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.