5

My usual method for extracting the min/max of a variable's data values from a NetCDF file is a magnitude of order slower when switching to the netCDF4 Python module compared to scipy.io.netcdf.

I am working with relatively large ocean model output files (from ROMS) with multiple depth levels over a given map region (Hawaii). When these were in NetCDF-3, I used scipy.io.netcdf.

Now that these files are in NetCDF-4 ("Classic") I can no longer use scipy.io.netcdf and have instead switched over to using the netCDF4 Python module. However, the slowness is a concern and I wondered if there is a more efficient method of extracting a variable's data range (minimum and maximum data values)?

Here was my NetCDF-3 method using scipy:

import scipy.io.netcdf
netcdf = scipy.io.netcdf.netcdf_file(file)
var = netcdf.variables['sea_water_potential_temperature']
min = var.data.min()
max = var.data.max()

Here is my NetCDF-4 method using netCDF4:

import netCDF4
netcdf = netCDF4.Dataset(file)
var = netcdf.variables['sea_water_potential_temperature']
var_array = var.data.flatten()
min = var_array.data.min()
max = var_array.data.max()

The notable difference is that I must first flatten the data array in netCDF4, and this operation apparently slows things down.

Is there a better/faster way?

7
  • 2
    Not sure what the real reason is, but the code base for scipy.io.netcdf and netcdf-python is quite different. First one is based on pupynere, which had some nice optimizations if I remember correctly. Also, the specifications of netCDF changed considerably from ver. 3 -> 4. I guess you could try something like np.array(var.data).max() to avoid the flattening of the netCDF Variable. It's hard to say because the structure of the netCDF file is unknown. Commented Feb 13, 2014 at 0:18
  • 2
    Why do you have to flatten the array? Does import numpy as np; np.max(var[:]) work? Commented Feb 13, 2014 at 13:13
  • Thanks. I guess flattening is not the bottleneck. Accessing the array via var[:] is. @abudis: In order to work, I had to modify your command to np.array(var[:].data).max(). @SpencerHill: yes, that works but is equally slow. These two suggestions each take the same amount of time as my example above: still slow. I suppose the scipy optimizations plus netCDF4 changes that @abudis mentions may be the culprit. Commented Feb 13, 2014 at 22:36
  • 1
    @JohnMaurer FYI var[:] is already a numpy.ndarray, so there's no need to call the np.array() function on it or access its data attribute. Just var[:].max() does the trick. That doesn't help with the computation speed though. A faster method doesn't immediately come to mind, but other more expert users likely know one. Commented Feb 14, 2014 at 2:02
  • 1
    How about using one of the nco operators, called as a subprocess? Something like: ncwa -y max ... Commented Aug 8, 2014 at 5:19

3 Answers 3

2

Per suggestion of hpaulj here is a function that calls the nco command ncwa using subprocess. It hangs terribly when using an OPeNDAP address, and I don't have any files on hand to test it locally.

You can see if it works for you and what the speed difference is.

This assumes you have the nco library installed.

def ncwa(path, fnames, var, op_type, times=None, lons=None, lats=None):
    '''Perform arithmetic operations on netCDF file or OPeNDAP data

    Args
    ----
    path: str
        prefix
    fnames: str or iterable
        Names of file(s) to perform operation on
    op_type: str
        ncwa arithmetic operation to perform. Available operations are:
        avg,mabs,mebs,mibs,min,max,ttl,sqravg,avgsqr,sqrt,rms,rmssdn
    times: tuple
        Minimum and maximum timestamps within which to perform the operation
    lons: tuple
        Minimum and maximum longitudes within which to perform the operation
    lats: tuple
        Minimum and maximum latitudes within which to perform the operation

    Returns
    -------
    result: float
        Result of the operation on the selected data

    Note
    ----
    Adapted from the OPeNDAP examples in the NCO documentation:
    http://nco.sourceforge.net/nco.html#OPeNDAP
    '''
    import os
    import netCDF4
    import numpy
    import subprocess

    output = 'tmp_output.nc'

    # Concatenate subprocess command
    cmd = ['ncwa']
    cmd.extend(['-y', '{}'.format(op_type)])
    if times:
        cmd.extend(['-d', 'time,{},{}'.format(times[0], times[1])])
    if lons:
        cmd.extend(['-d', 'lon,{},{}'.format(lons[0], lons[1])])
    if lats:
        cmd.extend(['-d', 'lat,{},{}'.format(lats[0], lats[1])])
    cmd.extend(['-p', path])
    cmd.extend(numpy.atleast_1d(fnames).tolist())
    cmd.append(output)

    # Run cmd and check for errors
    subprocess.run(cmd, stdout=subprocess.PIPE, check=True)

    # Load, read, close data and delete temp .nc file
    data = netCDF4.Dataset(output)
    result = float(data[var][:])
    data.close()
    os.remove(output)

    return result

path = 'https://ecowatch.ncddc.noaa.gov/thredds/dodsC/hycom/hycom_reg6_agg/'
fname = 'HYCOM_Region_6_Aggregation_best.ncd'

times = (0.0, 48.0)
lons = (201.5, 205.5)
lats = (18.5, 22.5)

smax = ncwa(path, fname, 'salinity', 'max', times, lons, lats)
Sign up to request clarification or add additional context in comments.

Comments

2

If you're just getting the min/max values across an array of a variable, you can use xarray.

%matplotlib inline
import xarray as xr

da = xr.open_dataset('infile/file.nc')
max = da.sea_water_potential_temperature.max()
min = da.sea_water_potential_temperature.min()

This should give you a single value of min/max, respectively. You could also get the min/max of a variable across a selected dimension like time, longitude, latitude etc. Xarray is great for handling multidimensional arrays that is why it's pretty easy to handle NetCDF in python when you're not using other operating tools like CDO and NCO. Lastly, xarray is also used in other related libraries that deals with weather and climate data in python ( http://xarray.pydata.org/en/stable/related-projects.html ).

Comments

1

A Python solution (using CDO as a backend) is my package nctoolkit (https://pypi.org/project/nctoolkit/ https://nctoolkit.readthedocs.io/en/latest/installing.html).

This has a number of built in methods for calculating different types of min/max values.

We would first need to read the file in as a dataset:

import nctoolkit as nc

data = nc.open_data(file)

If you wanted the maximum value across space, for each timestep, you would do the following:

data.spatial_max()

Maximum across all depths for each grid cell and time step would be calculated as follows:

data.vertical_max()

If you wanted the maximum across time, you would do:

data.max()

These methods are chainable, and the CDO backend is very efficient, so should be ideal for working with ROMS data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.