NetCDF

Why NetCDF?

Suppose your research has produced a time series of data, which you wanted to make accessible to other people. Imagine your time series is CO2 concentration in the atmosphere, which could be plotted as:
keeling.png

One obvious way to distribute the data would be as an ASCII file, with time in the first column and concentration in the second column. But what format should the time be written in? Should the ASCII file be in Unix or DOS format? See Unix and DOS issues or the Wikeipedia entry Newline. Should the time code be JUN1995 or 061995, or 199506 or something else? Do you expect people to be able to plot the data immediately, or do you expect the user to first calculate a number, truly linear in time, from the date stamp that you give? What if the instrument was "down" during some months, what symbol should you use to denote "missing data". Are you going to put "comments" in the file describing what the data is? Or will there be a separate "readme" file that describes the data?

Suppose, rather than 50 years of monthly averages of CO2 concentration at a point, you need to share many years of global analysis of winds. What would be the purpose of sending ASCII data if no human could be expected to scan more than a tiny portion of the data in a text editor or text viewer? In deference to the computing resources of the potential users of your data, the data should be stored as compactly as possible, using the minimum number of bits to represent the precision of the number. Should the data be stored in the standard binary format that can be read into a C or Fortran program? Would the standard 32 bit floating point number be a waste of space, since the data has only 3 significant digits? Should the binary data be big endian or little endian format?

Enter netCDF to solve these dilemmas.

Why not HDF?

In future years this page will be about HDF5. This page currently concerns NetCDF3. NetCDF4 will be just a special case of HDF5, as you can read in Wikipedia entry for NetCDF. There is a lot of data out there in NetCDF3 format. Presumably software will be developed to efficiently translate NetCDF3 files to NetCDF4. When that happens, we can all head for The HDF Group and Python users will be reaching for pytables to read/write both hdf and netcdf files. For 2010, we use the netCDF3 module, which emulates the syntax of an older package Scientific.IO.NetCDF, as described at the NetCDF Interface to Python. However, on gentry we import with from netCDF3 import Dataset as NetCDFFile. As of 21 June 2010, this does NOT work on gentry: from mpl_toolkits.basemap import  NetCDFFile

What follows is my own attempt at a NetCDF tutorial, using Python. For the tutorial and task for 2010, you will need

In your Gentry account: unzip nctask4.zip

A simple NetCDF file, and a simple plot

In your nctask4 directory, you will find a small file keeling.cdf. (If you are curious about where it came from, visit Climate Data Library. Go to "Keeling" and then to "Mauna_Loa", "Data Downloads", "Files". Grab the NetCDF file. (It is a rather ancient dataset, with data only through 1995).

Apparently, netCDF files can use either .nc or .cdf as an extension. This is confusing because there is another data format called CDF that is incompatible with netCDF. Gee, I wonder what extension CDF uses to designate its files?

If your computer has netCDF installed, as does Gentry, your computer will have a few tame pieces of software that allow C and Fortran programs to read and write netCDF files, as well as some short utility programs ncdump and ncgen.

To see a summary of what is in a netcdf file:

ncdump -h keeling.cdf

keeling.cdf is such a small file (and not really a worthy candidate for all the fuss of storing it as a NetCDF file), you can dump all the data to your screen also:

ncdump keeling.cdf

(cat keeling.cdf is probably NOT a good way to see the data.) To learn more about ncdump, type man ncdump.

Actually, ncdump is translating the binary netcdf file to an ascii file written in "common data language". It is possible to make modest edits in short NetCDF files by ncdump any.nc > mydump, then edit mydump, then ncgen -o mynew.nc mydump.

Now let's read keeling.cdf into a Python program, and plot out the data. Just type: simple.py You should see the above Keeling curve.

You should be familar with the use of matplotlib for plotting. We use netCDF3 for reading and writing netCDF3 files. You may be able to learn most of what you need to know about netCDF from the examples at netCDF3 , and from a Python interpreter, e.g.:

>>>import netCDF3
>>>help(netCDF3)

which displays help(netCDF3).

Once again, this may be useful: NetCDF Interface to Python

Multidimensional data in a NetCDF file

The files palmerna.cdf and climap.nc contain some multidimensional data. The data is rather interesting, independent of its use as examples in information technology.

Get a summary of the data:

ncdump -h palmerna.cdf
ncdump -h climap.nc

In your nctask3 directory, you will find a general purpose plotting progam for NetCDF data that can be mapped. Try this:

ncplotter.py climap.nc

enter X variable=>X
shape of X:  (180,)

enter Y variable=>Y
shape of Y:  (91,)

enter variable to plot=>LGM_aug_sst
shape of LGM_aug_sst:  (91, 180)
 using specified missing value = [-99999.,]

LGM_aug_sst has maximum  29.2000007629   and minimum: -1.10000002384
enter lower bound on color scale =>0
enter upper bound on color scale =>30.
using clim= [0.0, 30.0]

Or prestore your responses in a file, and do:

ncplotter.py climap.nc < myresponses

Or do everything from the command line:

ncplotter.py -X X -Y Y -v LGM_aug_sst -l 0. -u 30. climap.nc

You should see this:
LGM_aug_sst.png


Here is another plot:

ncplotter.py -p lam -r palmerna.cdf
enter X variable=>X
enter Y variable=>Y
enter variable to plot=>PDSI
enter record number to plot=>440
enter lower bound on color scale =>-10
enter upper bound on color scale =>10

You should see the Palmer Drought Severity Index in August 1956, a time of extreme drought in Oklahoma:
aug1956pdsi.png You don't need to know much about drought to complete the task, but you may want to visit:

How did I know rec=440 was the time of the minimum PSDI in Oklahoma? I ran hov.py ok, which contains several methods of data exploration within palmerna.cdf. Two figures are produced. The first is a Hovmoeller-like plot of PDSI (meaning a plot in the X-T plane) at the latitude of Oklahoma:

hov.png

The second is a times series of the PSDI in Oklahoma and New York, and a calculation of the correlation between the two time series, which continues the investigation of the size of droughts:
ts.png

The investigation of the size of droughts continues with corrmap.py, which produces corrmap.nc. A plot of the correlation between the PDSI in Oklahoma, with other areas in North America, is plotted with:

corrmap.py
ncplotter.py -p lam corrmap.nc
enter X variable=>X
enter Y variable=>Y
enter variable to plot=>corr
enter lower bound on color scale =>-1
enter upper bound on color scale =>1

or this:

ncplotter.py -p lam -X X -Y Y -v corr \
  -t 'PDSI correlation with OK' -l -1 -u 1 -o c.png  corrmap.nc

okcorr.png

Task for 2010

{*} Your task is to extend hov.py and corrmap.py to study autocorrelation. For every location, we can find the the correlation of the time series with the same time series, but displaced (lagged) by a certain number of months. Here is the plot of the autocorrelation at NY, as a function of time lag:
nyautocorr.png

The above plot can be made with less than 10 lines of new code within hov.py. Your subtask 1 is to make such changes to hov.py, but plotting out the autocorrelation at OK. (Of course, you may first want to check that your code can reproduce the above plot for NY).

Your subtask 2 involves making changes to corrmap.py, to make maps such as this: autocorr12.png

You need to make the maps for the 1 month lag and the 36 month lag.

Both of these subtasks have some tips in the comments within hov.py and corrmap.py.