Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Handling NetCDF files using XArray for absolute beginners

$
0
0

NetCDF is a machine-independent, array-oriented, multi-dimensional, self-describing, and portable data format used by various scientific communities. It has a filename extension of .nc or .cdf (though it is believed that there are subtle differences between the two). Unlike files in .csv or .xlsx , NetCDF format cannot be accessed and viewed directly using spreadsheet software.

Even if you could, you would not do that on a 4-dimensional data with a bunch of metadata.

I will take climate data from Atmospheric Radiation Measurement Climate Research Facility (ARM) in the United States, and European Centre for Medium-Range Weather Forecasts (ECMWF) in Europe as an example.

Prerequisites

We will use xarray library in python for data processing. Long story short, it builds upon numpy (and dask ) libraries and leverages the power of pandas , but you probably don’t need to know about it. As you might know, package dependency is a pain in Python. That is why the most convenient way to get everything installed is to use the following command:

$ conda install xarray dask netCDF4 bottleneck

Experienced Python programmers are recommended check the relevant documentation for more details. If you are a beginner, no worries. I made a list of dependencies that you need to check:

Python 2.7/3.5+ required numpy 1.12+ required pandas 0.19.2+ required scipy for interpolation features bottleneck for speeding up NaN-skipping netCDF4-python for basic netCDF operation such as reading/writing dask-array 0.16+ for parallel computing with dask

If you want to visualize your dataset, you will probably need these:

matplotlib 1.5+ for plotting cartopy for maps seaborn for better colour palettes

For absolute beginners, you can check your default version of Python by

$ python --version Python 2.7.5

You can also check if Python3 is installed by

$ python3 --version Python 3.4.9

To check the version of packages, use pip freeze or conda list . Things should check out if you install xarray through conda .

Alternatives

iris is an alternative to xarray , but some works need to be done to make it work on windows, and it does not work well on Mac OS. Iris is also an English word, so googling ‘iris’ gives you many irrelevant results. It was a pain for me to use iris .

Data Preview

It is always a good idea to ‘preview’ and ‘get to know’ your data, its metadata and data structures. Assume you have installed netCDF4-python and the only two commands you need are ncdump and ncview . The former gives text representation of your netCDF dataset (basically metadata and the data itself), while the latter is a very powerful graphical interface for instant data visualization.

ncdump

Go to the directory of your dataset and try

$ ncdump -h twparmbeatmC1.c1.20050101.000000.cdf

As we do not need to see the values of every data entry at the moment, -h ensures only header (metadata) is shown. You will get

netcdf twparmbeatmC1.c1.20050101.000000 { dimensions: time = UNLIMITED ; // (8760 currently) range = 2 ; p = 37 ; z = 512 ; variables: double base_time ; base_time:long_name = "Base time in Epoch" ; base_time:units = "seconds since 1970-1-1 0:00:00 0:00" ; base_time:string = "2005-01-01 00.00, GMT" ; base_time:ancillary_variables = "time_offset" ; float prec_sfc(time) ; prec_sfc:long_name = "Precipitation Rate" ; prec_sfc:standard_name = "lwe_precipitation_rate" ; prec_sfc:units = "mm/hour" ; prec_sfc:missing_value = -9999.f ; prec_sfc:_FillValue = -9999.f ; prec_sfc:source = "twpsmet60sC1.b1" ; float T_p(time, p) ; T_p:long_name = "Dry Bulb Temperature, from sounding in p coordinate" ; T_p:standard_name = "air_temperature" ; T_p:units = "K" ; T_p:missing_value = -9999.f ; T_p:_FillValue = -9999.f ; T_p:source = "twpsondewnpnC1.b1:tdry" ; // global attributes:
< OTHER METADATA >
}

You can see dimensions, variables, and other metadata which are quite self-explanatory. Global attributes (not printed above) tells us how the data is collected and pre-processed. In this example, they are measurement data taken at 147.4E 2.1S, Manus, Papua New Guinea by ARM.

When we look into the list of variables: 1-dim prec_sfc and 2-dim T_p , we realize that they have different dimensions(!). Precipitation rate is a scalar measurement at each time, whereas temperature is column (measurements at different pressure levels instead of altitude levels this time) at every time. It is quite common to see 4-dim data in climate science ― latitude, longitude, altitude/pressure level, time.

ncview

Try the following command and it gives you a graphical interface that lists all variables in your dataset, and it is quite straightforward.

$ ncview twparmbeatmC1.c1.20050101.000000.cdf
Handling NetCDF files using XArray for absolute beginners
Graphical interface in linux using ncview Terminology
Handling NetCDF files using XArray for absolute beginners
Data structures of xarray DataArray

xarray.DataArray is an implementation of a labelled, multi-dimensional array for a single variable , such as precipitation, temperature etc.. It has the following key properties:

values : a numpy.ndarray holding the array’s values dims : dimension names for each axis (e.g., ('lat', 'lon', 'z', 'time') ) coords : a dict-like container of arrays (coordinates) that label each point (e.g., 1-dim arrays of numbers, DateTime objects, or strings) attrs : an OrderedDict to hold arbitrary metadata (attributes) DataSet

xarray.DataSet is a collection of DataArrays. Each NetCDF file contains a DataSet.

Coding usingXArray Data Import

You cannot play with the data until you read it. Use open_dataset or open_mfdataset to read a single or multiple NetCDF files, and store it in a DataSet called DS .

import xarray as xr # single file dataDIR = '../data/ARM/twparmbeatmC1.c1.20050101.000000.cdf' DS = xr.open_dataset(dataDIR) # OR m

Viewing all articles
Browse latest Browse all 9596

Trending Articles