Create Dataframes based on array values

Question

I have a dataset that spans over several years.

# Import full dataset 
df_all = pd.read_csv('https://raw.githubusercontent.com/dssgPT/Plotting-Good-DSSG/main/desafios/006_Incendios/area_ardida.csv')
# Assign years variable to unique values in years column
years = df_all['year'].unique()
# Print years variable
print(years)

array([1999, 2000, 1997, 1992, 1995, 1998, 1980, 1996, 1982, 1987, 1989,
       1991, 1981, 1983, 1984, 1986, 1988, 1990, 1993, 1994, 1985, 2005,
       2006, 2007, 2010, 2011, 2012, 2001, 2002, 2003, 2004, 2008, 2009,
       2013, 2014, 2017, 2019, 2021, 2015, 2018, 2020, 2016])

My objective is to create different dataframes for each year, and then do some operations. I could do it manually as in this example below

df_2017 = df_all[df_all['year']==2017]
df_totals_2017 = df_2017.groupby(["concelho", "year"]).sum()
df_totals_2017.to_csv('/content/drive/MyDrive/2022/DATAVIZ/dssg_2017.csv')

but I'm wondering if there is a optimal way of doing this. Any help would be much appreciated.

Is the required outcome simply that a CSV is generated for each year? Does it necessarily need to be stuffed into a variable named with the specific year value? — esqew
– esqew, Commented Jul 6, 2022 at 12:55
Is your dataframe save on some sort of cloud service or anything correlated, or is it locally saved i am asking this because utilizing pandas for such tasks isnt exactly recommended — INGl0R1AM0R1
– INGl0R1AM0R1, Commented Jul 6, 2022 at 12:59
Whenever you see yourself needing to dynamically create variable names, that should set off a red flag. There are plenty of ways to operate on individual years within a single dataframe. If the requirement is yearly outputs, perhaps you can save the splitting until the output stage. — Chris
– Chris, Commented Jul 6, 2022 at 13:01

INGl0R1AM0R1 · Accepted Answer · 2022-07-06 13:04:29Z

2

This is a way to do it, not sure if it is most efficient but hey it will do the job

for i in years:
    grouped_df = df_all[df_all.eq(i)].groupby(["concelho"]).sum()
    #Little redundant to groupby year
    grouped_df.to_csv(f'/content/drive/MyDrive/2022/DATAVIZ/dssg_{i}.csv')

answered Jul 6, 2022 at 13:04

INGl0R1AM0R1

1,6287 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

INGl0R1AM0R1 Over a year ago

If you are looking for speed, another approach would to transform the dataframes into numpy momentarily and then, reindex the indexes of the df.

Jorge Gomes Over a year ago

@INGIOR1AMOR1 I actually tried it but it returns empty csv files for each year.

Jorge Gomes Over a year ago

@INGIOR1AMOR1 this does work grouped_df = df_all[df_all.year.eq(i)].groupby(["concelho"]).sum()

INGl0R1AM0R1 Over a year ago

oh, hahaha okay then

The Nomadic Coder · Accepted Answer · 2022-08-23 13:12:28Z

2

Taking your code and restructuring it slightly to loop over all the years and construct a dictionary of split dataframes

many_dfs = {y: df_all[df_all['year']==y] for y in df_all['year'].unique()}

On further exploration of this dictionary

>>> many_dfs.keys()
dict_keys([1999, 2000, 1997, 1992, 1995, 1998, 1980, 1996, 1982, 1987, 1989, 1991, 1981, 1983, 1984, 1986, 1988, 1990, 1993, 1994, 1985, 2005, 2006, 2007, 2010, 2011, 2012, 2001, 2002, 2003, 2004, 2008, 2009, 2013,2014, 2017, 2019, 2021, 2015, 2018, 2020, 2016])

>>> type(many_dfs[2017])
<class 'pandas.core.frame.DataFrame'>

>>> many_dfs[2017].columns
Index(['Distrito', 'concelho', 'month', 'year', 'sum', 'longitude',
   'latitude'],
  dtype='object')

edited Aug 23, 2022 at 13:12

answered Jul 6, 2022 at 13:10

The Nomadic Coder

6701 gold badge6 silver badges17 bronze badges

Collectives™ on Stack Overflow

Create Dataframes based on array values

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related