Python Reading Multiple NetCDF Rainfall files of variable size

Question

The issue I have is that the Australian Bureau of Meteorology has supplied me with Rainfall Data Files, that contains rainfall records recorded every 30 minutes for all active gauges. The problem is that for 1 day there are 48 30Minute files. I want to create time series of a particular Gauge. Which means reading all 48 files and searching for the Gauge ID, making sure it doesn't fail if for 1 30 minute period the gauge did not record anything?? here is link to file format:

https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_000000.nc

https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_003000.nc

https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_010000.nc

This is what I have tried so far:

"""
This script is used to read a directory of raingauge data from a Data Directory





"""
from anuga.file.netcdf import NetCDFFile
from anuga.config import netcdf_mode_r, netcdf_mode_w, netcdf_mode_a, \
                            netcdf_float
import os
import glob
from easygui import *
import string
import numpy
"""
print 'Default file Extension...'
msg="Enter 3 letter extension."
title = "Enter the 3 letter file extension to search for in DIR "
default = "csv"
file_extension = enterbox(msg,title,default)
"""


print 'Present Directory Open...'
title = "Select Directory to Read Multiple rainfall .nc files"
msg = "This is a test of the diropenbox.\n\nPick the directory that you wish to open."
d = diropenbox(msg, title)
fromdir = d

filtered_list = glob.glob(os.path.join(fromdir, '*.nc'))
filtered_list.sort()

nf = len(filtered_list)
print nf

import numpy

rain = numpy.zeros(nf,'float')
t = numpy.arange(nf)

Stn_Loc_File='Station_Location.csv'
outfid = open(Stn_Loc_File, 'w')

prec = numpy.zeros((nf,1752),numpy.float)

gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
                 '570047','570772','570781','570910','570903','570916','570931','570943','570965',
                 '570968','570983','570986','70214','70217','70349','70351']
"""
title = "Select Gauge to plot"
msg = "Select Gauge"
gauge_id = int(choicebox(msg=msg,title=title, choices=gauge_id_list))
"""
#for gauge_id in gauge_id_list:
#    gauge_id = int(gauge_id)
try:    

    for i, infile in enumerate(filtered_list):

        infilenet = NetCDFFile(infile, netcdf_mode_r)
        print infilenet.variables
        raw_input('Hold.... check variables...')
        stn_lats = infilenet.variables['latitude']
        stn_longs = infilenet.variables['longitude']
        stn_ids = infilenet.variables['station_id']
        stn_rain = infilenet.variables['precipitation']

        print stn_ids.shape
        #print stn_lats.shape
        #print stn_longs.shape
        #print infile.dimensions
        stn_ids = numpy.array(stn_ids)

        l_id = numpy.where(stn_ids == gauge_id)
        if stn_ids in gauge_id_list:
            try:
                l_id = l_id[0][0]
                rain[i] = stn_rain[l_id]
            except:
                rain[i] = numpy.nan
    print 'End for i...'            
    #print rain

    import pylab as pl

    pl.bar(t,rain)
    pl.title('Rain Gauge data')
    pl.xlabel('time steps')
    pl.ylabel('rainfall (mm)')
    pl.show()
except:
    pass 
raw_input('END....')

This snippet is too long. Could you a) decribe the part you're having trouble with and b) cut down your code sample to the bit that illustrates it? Also, why are you using anuga.file.netcdf -- any particular reason for this choice? — chryss
– chryss, Commented Sep 22, 2013 at 6:43
Sorry... Yes this is essentially part of the Open Source Model ANUGA. But it uses NetCDF. So this is where it lives on my machine. But Essentially I am having trouble opening and reading the NetCDF file, and grouping data from variable sized data files. The station_id is what I want to collect from the 48 daily files? { PS I am a complete novice... well out of my depth ... but trying to give it a go, to get involved in Open Source....} — Rudy Van Drie
– Rudy Van Drie, Commented Sep 22, 2013 at 11:43
I answered below, but for the future, please consider cutting down your example code (by about 80%), showing ONLY the part you're having trouble with. Also, some output would be good. — chryss
– chryss, Commented Sep 22, 2013 at 19:27
You wanted to comment on my answer, but couldn't as it takes 50 privilege point to comment. Your edit was rejected (not by me -- the community vote). Basically, you said I was using the wrong variable names. ... — chryss
– chryss, Commented Sep 23, 2013 at 15:11
The code I posted runs on the files you provided. For the variable names, I get: In [117]: data.variables Out[117]: {... 'precip': <scipy.io.netcdf.netcdf_variable at 0x108692090>, 'stid': <scipy.io.netcdf.netcdf_variable at 0x1086921d0>, } If in your files or with your API the names are different, use these. Also, I suggest you familiarise yourself with the way the community works! stackoverflow.com/about — chryss
– chryss, Commented Sep 23, 2013 at 15:12

chryss · Accepted Answer · 2013-09-23 15:47:27Z

OK, you got the data in a format that's more convoluted than it would need to be. They could easily have stuffed the whole day into a netCDF file. And indeed, one option for you to solve this would have been to combine all files into one with a times dimension, using for example the NCO command line tools.

But here is a solution that uses the scipy netcdf module. I believe it is deprecated -- myself, I prefer the NetCDF4 library. The main approach is: preset your output data structure with np.nan values; loop through your input files and retrieve precipitation and station ids; for each of your stationids of interest, retrieve index, and then precipitation at that index; add to the output structure. (I didn't do the work to extract timestamps - that's up to you.)

import glob
import numpy as np
from scipy.io import netcdf

# load data file names 
stationdata = glob.glob('gauge*.nc')
stationdata.sort()
# initialize np arrays of integer gauging station ids
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
                 '570047','570772','570781','570910','570903','570916','570931','570943','570965',
                 '570968','570983','570986','70214','70217','70349','70351']
gauge_ids = np.array(gauge_id_list).astype('int32')
ngauges = len(gauge_ids)
ntimesteps = 48
# initialize output dictionary
dtypes = zip(gauge_id_list, ['float32']*ngauges)
timeseries_per_station = np.empty((ntimesteps,))
timeseries_per_station.fill(np.nan)
timeseries_per_station = timeseries_per_station.astype(dtypes)

# Instead of using the index, you could extract the datetime stamp 
for timestep, datafile in enumerate(stationdata):
    data = netcdf.NetCDFFile(datafile, 'r')
    precip = data.variables['precip'].data
    stid = data.variables['stid'].data
    # create np array of indices of the gaugeid present in file
    idx = np.where(np.in1d(stid, gauge_ids))[0]
    for i in idx:
        timeseries_per_station[str(stid[i])][timestep] = precip[i]
    data.close()

np.set_printoptions(precision=1)
for gauge_id in gauge_id_list:
    print "Station %s:" % gauge_id
    print timeseries_per_station[gauge_id]

The output looks like this:

Station 570002:
[ 1.9  0.3  0.   nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan]
Station 570021:
[  0.   0.   0.  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan]
...

(Obviously, there were only three files.)

Edit: The OP noted that the code wasn't running without errors for him because his variable names are "precipitation" and "station_id". The code runs for me on the files he posted. Obviously, he should be using whatever variable names are used in the files that he was supplied with. As they seem to be custom-produced files for his use, it is conceivable that the authors may not be consistent in variable naming.

Thank you ... yes you are right. Seems like the Bureau of Meteorology have not developed consistent standards for extracting their data?

Collectives™ on Stack Overflow

Python Reading Multiple NetCDF Rainfall files of variable size

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related