Climate Data

Author:Graziano Giuliani <ggiulian@ictp.it>

What is Climate Science

The study of Climate was born as a science in the last century from the growing evidence of the human origin of the rapid increase in the global average temperatures registered in last 60 years. Given also the fact that the global climate change is an ongoing phenomenon and that human activities can enhance or mitigate its effects in the near future, a consistent effort is put by government and institutions to asses the possible impacts and required policies.

This decision taking process must be given the best available information on the expected change: scientists have a role in the process as brokers of information.

Datasets

The information needed can be divided in:

  1. Past data to reconstruct the history of the climate of the whole Planet in the last 12000 years. This data are in the fom of

    • Measurements from proxies such as tree rings, pollen, polar ice sheets drillings, etc.
    • Historical documents from the different human cultures
    • Reconstruction of landscapes from paintings, images
    • Hand recorded measurements taken from instruments in the last 400 years
  2. Records of the near past weather informations collected for short term forecasts in the last 50 years. This data are in the form of

    • Digitized measurements from instruments both at the surface and on vertical profiles in the atmosphere or in the depth for oceans
    • Satellite measurements for the last 30 years
    • Economical, population and emission dataset
    • Geological and Ecological data to best describe the Ecosystem
  3. Future scenarioes data from models to evaluate the expected change for mitigation purposes or the outcome of possible policies on the change.

The big amount of data expecially for the last two points above require a big effort to establish standards for data exchange between different organizations and different comunities.

http://www.realclimate.org/index.php/data-sources

The Magic Words here are COMMUNITY CONTROLLED STANDARDS.

Let’s go with an example:

  • In country A a government organization has a daily temperature timeseries of measurements from an instrument to share. But its national convention is to register the values in degrees Fahrenheit and the data are digitally registerd on magnetic support in a binary readable format where the date is expressed as the three values DDMMYY for day,month,year in the local calendar
  • The organiztion in country B receives the binary data files, along with a text document. The usage of the data then needs the intervention of a skilled computer programmer to be read them, change the unit of measure, fix the correct date to merge the measures in a larger dataset.

Repeat this for each measure/instrument/satellite/model data and multiply by a thousand the number of involved organizations, possibly throwing in commercial lock ins of profit organization.

IF a STANDARD for data storage and transmission can be respected, then the global knowledge coming from the data can be much easily used, empowering researchers worldwide and freeing their time to be given to the final scientific goal.

The Format and the Conventions

In the Climate and Forecast community the netCDF is the data format, and the CF conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data.

This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.

The document describing the standard is maintained and enhanced by free contributions from users through a Mailing list and a Trac interface at

http://cf-pcmdi.llnl.gov

The Program for Climate Model Diagnosis and Intercomparison is hosted in the Lawrence Livermore National Laboratory in San Francisco (US) from 1989, and was given the task to find an easy way to compare the results of different climate models.

netCDF data format

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The NetCDF was developed and is maintained at Unidata, part of the University Corporation for Atmospheric Research (UCAR) Community Programs (UCP).

http://www.unidata.ucar.edu/software/netcdf

Unidata is funded primarily by the National Science Foundation.

The file is structured on a large view as a file system with a root Group, which can contain in a hierarchical mode other groups. Each group can contain three different entitities:

  • Dimension which define the dimensions of any variable in all sub-groups
  • Variables which contain data and can be of pre-defined types or any user defined type.
  • Attributes which define metadata to be relevant at any group or variable level.
../_images/nc4-model.png

The library provides a C, C++ and Java language interfaces to access the data, on top of which all other possible programming languages interfaces are implemented.i

An easily extendable layer of filtering can be applied to the actual data before writing them on a a file system, and the library gives the programmer a clear set of functions in a well engineered API to store/retrieve data and metadata.

CF Conventions

On top of the capabilities offered by the netCDF format, a Convention among different data production entities has been established to ease the sharing of data.

The convention mostly defines the type of informations to be provided in the metadata, the name and the recognized unit of measures of climate relevant bariables, the way to code relations and ancillary informations to allow the user write applications to operate on the data.

In particular basic informations are coded to ease:

  • Identificate the Origin of the data, to answer question what,where,when and how on the data in the file.
  • Standard Name to identify which geophysical variable is in the file, regardless of the application specific name given to it.
  • Unit of Measure of reference for each variable, along with a set of allowed conversion interfaces
  • Geolocation attributes which allow standard views or projections to be easily identified and allow regridding capabilities to be built on top.
  • Time calendars and units to allow the user to fix the point in time where the information is valid
  • Spatial and temporal bounds to define the spatio-temporal grid on which the information is valid
  • Packing and reduction of the original data
  • Climatological statistics which were applied to the original data

The advantages coming from the enhanced capabilities given by the conformity to the standard at the application level must obviously be grater then the effort required to implement conformity to the standard at the data generation level.

Distributing Climate Data

On top of the netCDF data format, the netCDF libary API allows an application to access a remote dataset using URLs instead of local file paths.

This is permitted by the OPeNDAP protocol, an acronym for “Open-source Project for a Network Data Access Protocol”.

OPeNDAP is a data transport architecture and protocol based on HTTP which includes standards for encapsulating structured data, annotating the data with attributes and adding semantics that describe the data.

The protocol is maintained by OPeNDAP.org, a publicly funded non-profit organization that also provides free reference implementations of OPeNDAP servers and clients.

The netCDF through the cURL library acts as a client, sending requests to an OPeNDAP server, and receives various types of documents or binary data as a response.

One such document is called a DDS (received when a DDS request is sent), that describes the structure of a data set. This is parsed by the library to present at the API the same structure offered by a netCDF on-disk file.

A data set, seen from the server side, may be a file, a collection of files or a database. Another document type that may be received is DAS, which gives attribute values on the fields described in the DDS. Binary data is received when the client sends a DODS request, and the netCDF library allows the programmer to write exactly the same code to access local and remote data.

An OPeNDAP server can serve an arbitrarily large collection of data. Data on the server is often nativley in HDF or NetCDF format, but can be in any format including a user-defined format. Compared to ordinary file transfer protocols (e.g. FTP), a major advantage using OPeNDAP is the ability to retrieve subsets of files, and also the ability to aggregate data from several files in one transfer operation.

A number of OPeNDAP server are available, and among those we will examine RAMADDA

http://ramadda.org/repository

and the THREDDS Data Server

http://www.unidata.ucar.edu/software/thredds/current/tds/TDS.html

They are not the OPeNDAP reference data server (which is called HyRAX), but they are providing a better user experience in presenting the server available data.