3

I have the following Xarray named 'scatch' with lat long and lev coords eliminated and only the time coord as a dimension. It has several variables. It is now a multivariate daily time-series from 2002 to 2014. I need to add a new variable "water_year", that shows what water-year is that day of the year. It could be by adding another column in the variables by Xarray.assign or by Xarray.resample but I am not sure, and could use some help. Note: "Water Year" starts from Oct 01, and ends on Sep 30 the next year. So water-year-2003 would be 10-01-2002 to 09-30-2003.

See my Xarray here

See my Xarray here

1
  • Like this? Commented May 17, 2022 at 4:51

1 Answer 1

4

I'll create a sample dataset with a single variable for this example:

In [2]: scratch = xr.Dataset(
   ...:     {'Baseflow': (('time', ), np.random.random(4018))},
   ...:     coords={'time': pd.date_range('2002-10-01', freq='D', periods=4018)},
   ...: )

In [3]: scratch
Out[3]:
<xarray.Dataset>
Dimensions:   (time: 4018)
Coordinates:
  * time      (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
Data variables:
    Baseflow  (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686

We can build a water_year array using the Datetime Components accessor .dt:

In [4]: water_year = (scratch.time.dt.month >= 10) + scratch.time.dt.year
   ...: water_year
Out[4]:
<xarray.DataArray (time: 4018)>
array([2003, 2003, 2003, ..., 2013, 2013, 2013])
Coordinates:
  * time     (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30

Because water_year is a DataArray indexed by an existing dimension, we can just add it as a coordinate and xarray will understand that it's a non-dimension coordinate. This is important to make sure we don't create a new dimension in our data.

In [7]: scratch.coords['water_year'] = water_year

In [8]: scratch
Out[8]:
<xarray.Dataset>
Dimensions:     (time: 4018)
Coordinates:
  * time        (time) datetime64[ns] 2002-10-01 2002-10-02 ... 2013-09-30
    water_year  (time) int64 2003 2003 2003 2003 2003 ... 2013 2013 2013 2013
Data variables:
    Baseflow    (time) float64 0.7588 0.05129 0.9914 ... 0.7744 0.6581 0.8686

Because water_year is indexed by time, we still need to select from the arrays using the time dimension, but we can subset the arrays to specific water years:

In [9]: scratch.sel(time=(scratch.water_year == 2010))
Out[9]:
<xarray.Dataset>
Dimensions:     (time: 365)
Coordinates:
  * time        (time) datetime64[ns] 2009-10-01 2009-10-02 ... 2010-09-30
    water_year  (time) int64 2010 2010 2010 2010 2010 ... 2010 2010 2010 2010
Data variables:
    Baseflow    (time) float64 0.441 0.7586 0.01377 ... 0.2656 0.1054 0.6964

Aggregation operations can use non-dimension coordinates directly, so the following works:

In [10]: scratch.groupby('water_year').sum()
Out[10]:
<xarray.Dataset>
Dimensions:     (water_year: 11)
Coordinates:
  * water_year  (water_year) int64 2003 2004 2005 2006 ... 2010 2011 2012 2013
Data variables:
    Baseflow    (water_year) float64 187.6 186.4 184.7 ... 185.2 189.6 192.7
Sign up to request clarification or add additional context in comments.

1 Comment

Excellent Michael, that is exactly what I was looking for. Thank you so much for the effort explaining this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.