0

I have a dataset that spans over several years.

# Import full dataset 
df_all = pd.read_csv('https://raw.githubusercontent.com/dssgPT/Plotting-Good-DSSG/main/desafios/006_Incendios/area_ardida.csv')
# Assign years variable to unique values in years column
years = df_all['year'].unique()
# Print years variable
print(years)

array([1999, 2000, 1997, 1992, 1995, 1998, 1980, 1996, 1982, 1987, 1989,
       1991, 1981, 1983, 1984, 1986, 1988, 1990, 1993, 1994, 1985, 2005,
       2006, 2007, 2010, 2011, 2012, 2001, 2002, 2003, 2004, 2008, 2009,
       2013, 2014, 2017, 2019, 2021, 2015, 2018, 2020, 2016])

My objective is to create different dataframes for each year, and then do some operations. I could do it manually as in this example below

df_2017 = df_all[df_all['year']==2017]
df_totals_2017 = df_2017.groupby(["concelho", "year"]).sum()
df_totals_2017.to_csv('/content/drive/MyDrive/2022/DATAVIZ/dssg_2017.csv')

but I'm wondering if there is a optimal way of doing this. Any help would be much appreciated.

7
  • Is the required outcome simply that a CSV is generated for each year? Does it necessarily need to be stuffed into a variable named with the specific year value? Commented Jul 6, 2022 at 12:55
  • For the specific case yes Commented Jul 6, 2022 at 12:58
  • Yes... to which of my two questions...? Commented Jul 6, 2022 at 12:58
  • Is your dataframe save on some sort of cloud service or anything correlated, or is it locally saved i am asking this because utilizing pandas for such tasks isnt exactly recommended Commented Jul 6, 2022 at 12:59
  • Whenever you see yourself needing to dynamically create variable names, that should set off a red flag. There are plenty of ways to operate on individual years within a single dataframe. If the requirement is yearly outputs, perhaps you can save the splitting until the output stage. Commented Jul 6, 2022 at 13:01

2 Answers 2

2

This is a way to do it, not sure if it is most efficient but hey it will do the job

for i in years:
    grouped_df = df_all[df_all.eq(i)].groupby(["concelho"]).sum()
    #Little redundant to groupby year
    grouped_df.to_csv(f'/content/drive/MyDrive/2022/DATAVIZ/dssg_{i}.csv')
Sign up to request clarification or add additional context in comments.

4 Comments

If you are looking for speed, another approach would to transform the dataframes into numpy momentarily and then, reindex the indexes of the df.
@INGIOR1AMOR1 I actually tried it but it returns empty csv files for each year.
@INGIOR1AMOR1 this does work grouped_df = df_all[df_all.year.eq(i)].groupby(["concelho"]).sum()
oh, hahaha okay then
2

Taking your code and restructuring it slightly to loop over all the years and construct a dictionary of split dataframes

many_dfs = {y: df_all[df_all['year']==y] for y in df_all['year'].unique()}

On further exploration of this dictionary

>>> many_dfs.keys()
dict_keys([1999, 2000, 1997, 1992, 1995, 1998, 1980, 1996, 1982, 1987, 1989, 1991, 1981, 1983, 1984, 1986, 1988, 1990, 1993, 1994, 1985, 2005, 2006, 2007, 2010, 2011, 2012, 2001, 2002, 2003, 2004, 2008, 2009, 2013,2014, 2017, 2019, 2021, 2015, 2018, 2020, 2016])

>>> type(many_dfs[2017])
<class 'pandas.core.frame.DataFrame'>

>>> many_dfs[2017].columns
Index(['Distrito', 'concelho', 'month', 'year', 'sum', 'longitude',
   'latitude'],
  dtype='object')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.