Data Serialization¶

  • Pickle and Shelve
  • Numpy: npy, npz
  • JSON
  • Dataframes: CSV and Feather
  • HDF5, NetCDF, and Xarray

Pickle and Shelve¶

Native Python serialization

  • Pickle.
  • Shelve.

What can be pickled and unpickled?¶

The following types can be pickled:

  • None, True, and False;
  • integers, floating-point numbers, complex numbers;
  • strings, bytes, bytearrays;
  • tuples, lists, sets, and dictionaries containing only picklable objects;

For our purposes, the list stops here. In reality, the following can also be pickled, but I do not recommend doing so for now. There are scenarios where pickling these things makes sense, but they are more advanced and we will not discuss them for now:

  • functions (built-in and user-defined) defined at the top level of a module (using def, not lambda);
  • classes defined at the top level of a module;
  • instances of such classes whose __dict__ or the result of calling __getstate__() is picklable.

Warning:

The pickle module is not secure! Only unpickle data you trust.

  • It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
  • Consider signing data with hmac if you need to ensure that it has not been tampered with.
  • Safer serialization formats such as json may be more appropriate if you are processing untrusted data.
In [1]:
import pickle
pickle.dumps(1234, protocol=0)
Out[1]:
b'I1234\n.'
In [2]:
pickle.dumps([1])
Out[2]:
b'\x80\x04\x95\x06\x00\x00\x00\x00\x00\x00\x00]\x94K\x01a.'
In [3]:
a = [1, 1.5, "hello", {3, 4}, {'int': 9, 'real': 9.0, 'complex': 9j}]

with open('data.pkl', 'wb') as f:
    pickle.dump(a, f)
    
with open('data.pkl', 'rb') as f:
    b = pickle.load(f)
    
b
Out[3]:
[1, 1.5, 'hello', {3, 4}, {'int': 9, 'real': 9.0, 'complex': 9j}]

A shelf of pickles¶

The shelve module provides a disk-stored object that behaves (mostly) like a dict, whose keys are strings and whose values are anything that can be pickled.

In [4]:
import shelve

x = 1234

with shelve.open('spam') as db:
    db['eggs'] = 'eggs'
    db['numbers'] = [1,2, 3, 9.99, 1j]
    db['xx'] = x
In [5]:
%ls -l spam.*
-rw-r--r-- 1 jovyan jovyan   53 Apr  6 17:52 spam.bak
-rw-r--r-- 1 jovyan jovyan 1030 Apr  6 17:52 spam.dat
-rw-r--r-- 1 jovyan jovyan   53 Apr  6 17:52 spam.dir

Note:

Do not rely on the shelf being closed automatically; always call close() explicitly when you don’t need it any more, or use shelve.open() as a context manager, as shown above.

In [6]:
with shelve.open('spam') as db:
    e = db['eggs']
    n = db['numbers']

print(f'{e = }')
print(f'{n = }')
e = 'eggs'
n = [1, 2, 3, 9.99, 1j]
In [7]:
db = shelve.open('spam')
In [8]:
for var, data in db.items():
    print(f'{var} = {data}')
eggs = eggs
numbers = [1, 2, 3, 9.99, 1j]
xx = 1234
In [9]:
db.close()

Numpy: npy, npz¶

How to save and load NumPy objects provides more details, and all input/output APIs in numpy are described here.

As a minimum starter, you should know that:

  • Numpy has a native, simple, efficient, binary storage format for single arrays as .npy files. These are portable across machines and versions of Numpy.
  • Multiple arrays can be stored in a dict-like form in a single .npz file.

The relationship between npy and npz files is somewhat similar to that between single pickles and shelve objects in Python.

In [10]:
import numpy as np

a = np.array([1, 2, 3.4])
fname = 'arr.npy'
np.save(fname, a)
b = np.load(fname)
(a == b).all()
Out[10]:
True

Multiple arrays (or scalar data) can be saved in a shelve-like object with the np.savez() function, that writes .npz files:

In [11]:
fname = 'arrays.npz'
np.savez(fname, a=a, b=np.random.normal(10), c=3.4)
In [12]:
arrays = np.load(fname)
arrays.files
Out[12]:
['a', 'b', 'c']
In [13]:
arrays['a']
Out[13]:
array([1. , 2. , 3.4])
In [14]:
arrays['c']
Out[14]:
array(3.4)

JSON¶

JSON stands for JavaScript Object Notation - it is a human-readable format (it looks kind of like a Python dict; Jupyter notebooks are JSON files on-disk) that can represent native JavaScript types. In some cases it can be an alternative to Pickle, with the advantage of being natively portable to the Web and the JavaScript ecosystem. From the Python pickle docs, this quick comparison is useful for our purposes:

  • JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;

  • JSON is human-readable, while pickle is not;

  • JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;

  • JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);

  • Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.

In [15]:
import json
a = ['foo', {'bar': ['baz', None, 1.0, 2]}]
json.dumps(a)
Out[15]:
'["foo", {"bar": ["baz", null, 1.0, 2]}]'
In [16]:
with open('test.json', 'w') as f:
    json.dump(a, f)
    
with open('test.json', 'r') as f:
    b = json.load(f)
    
b
Out[16]:
['foo', {'bar': ['baz', None, 1.0, 2]}]
In [17]:
a == b
Out[17]:
True

But be careful:

In [18]:
c = ['foo', {'bar': ('baz', None, 1.0, 2)}]

with open('test2.json', 'w') as f:
    json.dump(c, f)
    
with open('test2.json', 'r') as f:
    d = json.load(f)
    
c == d
Out[18]:
False
In [19]:
d
Out[19]:
['foo', {'bar': ['baz', None, 1.0, 2]}]
In [20]:
from IPython.display import JSON

JSON(c)
Out[20]:
<IPython.core.display.JSON object>

GeoJSON: the meaning of a schema.

In [21]:
classroom = {
    "type": "Feature",
    "geometry": {
        "type": "Point",
        "coordinates": [-122.25915, 37.87125]
    },
    "properties": {
    "name": "Wheeler Hall Auditorium"
  }
}

JSON(classroom)
Out[21]:
<IPython.core.display.JSON object>
In [22]:
from IPython.display import GeoJSON

GeoJSON(classroom)
<IPython.display.GeoJSON object>

Dataframes: CSVs and Feather¶

Some useful performance comparisons regarding various ways of saving dataframes.

In [23]:
from pathlib import Path

import pandas as pd

df = pd.read_csv(Path.home()/"shared/climate-data/monthly_in_situ_co2_mlo_cleaned.csv")
df
Out[23]:
year month date_index fraction_date c02 data_adjusted_season data_fit data_adjusted_seasonally_fit data_filled data_adjusted_seasonally_filed
0 1958 1 21200 1958.0411 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
1 1958 2 21231 1958.1260 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
2 1958 3 21259 1958.2027 315.70 314.43 316.19 314.90 315.70 314.43
3 1958 4 21290 1958.2877 317.45 315.16 317.30 314.98 317.45 315.16
4 1958 5 21320 1958.3699 317.51 314.71 317.86 315.06 317.51 314.71
... ... ... ... ... ... ... ... ... ... ...
763 2021 8 44423 2021.6219 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
764 2021 9 44454 2021.7068 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
765 2021 10 44484 2021.7890 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
766 2021 11 44515 2021.8740 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
767 2021 12 44545 2021.9562 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99

768 rows × 10 columns

In [24]:
%ls -l ~/shared/climate-data/monthly_in_situ_co2_mlo_cleaned.csv 
-rw-r--r-- 1 jovyan jovyan 50201 Nov  3 07:10 /home/jovyan/shared/climate-data/monthly_in_situ_co2_mlo_cleaned.csv
In [25]:
df.to_feather("co2.fth")
%ls -l co2*
-rw-r--r-- 1 jovyan jovyan 32218 Apr  6 17:52 co2.fth
In [26]:
df2 = pd.read_feather("co2.fth")
df2
Out[26]:
year month date_index fraction_date c02 data_adjusted_season data_fit data_adjusted_seasonally_fit data_filled data_adjusted_seasonally_filed
0 1958 1 21200 1958.0411 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
1 1958 2 21231 1958.1260 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
2 1958 3 21259 1958.2027 315.70 314.43 316.19 314.90 315.70 314.43
3 1958 4 21290 1958.2877 317.45 315.16 317.30 314.98 317.45 315.16
4 1958 5 21320 1958.3699 317.51 314.71 317.86 315.06 317.51 314.71
... ... ... ... ... ... ... ... ... ... ...
763 2021 8 44423 2021.6219 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
764 2021 9 44454 2021.7068 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
765 2021 10 44484 2021.7890 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
766 2021 11 44515 2021.8740 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
767 2021 12 44545 2021.9562 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99

768 rows × 10 columns

HDF5, NetCDF, and Xarray¶

Here is a nice introduction to HDF5 from our NERSC friends, and this is a good intro tutorial with code examples. The docs for the h5py Python library have more techincal details.

In brief (oversimplifying, but OK for our purposes):

  • HDF5 is a flexible binary file format that can store hierarchically nested data, with native support for multidimensional dense arrays of any numerical type. You can think of it as "a filesystem in a file", in that you can nest values in named "groups" (aka folders), and you can also store lots of metadata.
  • NetCDF is a data model. It specifies how the data should be structured.
  • Xarray is a Python library for numerical computing and data analysis that exposes the NetCDF data model as Python objects. Xarray objects have rich computational capabilities that, to first approximation, are a mix of the power of Numpy arrays and Pandas DataFrames.
{Note}
When we say NetCDF, we will strictly mean NetCDF4. There's an older version 3 that wasn't based on HDF5, and that we will not discuss further.

Today, most NetCDF files you encountered use the HDF5 binary format for storage, but as of 2020 NetCDF data can also be stored using the Zarr format that is more suited for cloud storage than HDF5, which was mostly designed for supercomputers.

So, the picture is:

  • A NetCDF file stored using HDF5 is always a valid HDF5 file.
  • A NetCDF file can also be stored in Zarr format. You're more likely to encounter these when working in the cloud. For small files, it doesn't make much difference whether the format is h5 or zarr, but for larger data it does.
  • The HDF5 format can represent data that is not valid NetCDF: it supports a richer set of capabilities beyond NetCDF. Unless you have extremely specialized needs, I suggest you stick to the NetCDF model, which is already very rich and powerful.

In [27]:
from pathlib import Path
import xarray as xr

DATA_DIR = Path.home()/Path('shared/climate-data')
ds = xr.open_dataset(DATA_DIR / "era5_monthly_2deg_aws_v20210920.nc")
ds
Out[27]:
<xarray.Dataset>
Dimensions:                                                                                   (
                                                                                               time: 504,
                                                                                               latitude: 90,
                                                                                               longitude: 180)
Coordinates:
  * time                                                                                      (time) datetime64[ns] ...
  * latitude                                                                                  (latitude) float32 ...
  * longitude                                                                                 (longitude) float32 ...
Data variables: (12/15)
    air_pressure_at_mean_sea_level                                                            (time, latitude, longitude) float32 ...
    air_temperature_at_2_metres                                                               (time, latitude, longitude) float32 ...
    air_temperature_at_2_metres_1hour_Maximum                                                 (time, latitude, longitude) float32 ...
    air_temperature_at_2_metres_1hour_Minimum                                                 (time, latitude, longitude) float32 ...
    dew_point_temperature_at_2_metres                                                         (time, latitude, longitude) float32 ...
    eastward_wind_at_100_metres                                                               (time, latitude, longitude) float32 ...
    ...                                                                                        ...
    northward_wind_at_100_metres                                                              (time, latitude, longitude) float32 ...
    northward_wind_at_10_metres                                                               (time, latitude, longitude) float32 ...
    precipitation_amount_1hour_Accumulation                                                   (time, latitude, longitude) float32 ...
    sea_surface_temperature                                                                   (time, latitude, longitude) float32 ...
    snow_density                                                                              (time, latitude, longitude) float32 ...
    surface_air_pressure                                                                      (time, latitude, longitude) float32 ...
Attributes:
    institution:  ECMWF
    source:       Reanalysis
    title:        ERA5 forecasts
xarray.Dataset
    • time: 504
    • latitude: 90
    • longitude: 180
    • time
      (time)
      datetime64[ns]
      1979-01-16T11:30:00 ... 2020-12-...
      array(['1979-01-16T11:30:00.000000000', '1979-02-14T23:30:00.000000000',
             '1979-03-16T11:30:00.000000000', ..., '2020-10-16T11:30:00.000000000',
             '2020-11-15T23:30:00.000000000', '2020-12-16T11:30:00.000000000'],
            dtype='datetime64[ns]')
    • latitude
      (latitude)
      float32
      -88.88 -86.88 ... 87.12 89.12
      long_name :
      latitude
      standard_name :
      latitude
      units :
      degrees_north
      array([-88.875, -86.875, -84.875, -82.875, -80.875, -78.875, -76.875, -74.875,
             -72.875, -70.875, -68.875, -66.875, -64.875, -62.875, -60.875, -58.875,
             -56.875, -54.875, -52.875, -50.875, -48.875, -46.875, -44.875, -42.875,
             -40.875, -38.875, -36.875, -34.875, -32.875, -30.875, -28.875, -26.875,
             -24.875, -22.875, -20.875, -18.875, -16.875, -14.875, -12.875, -10.875,
              -8.875,  -6.875,  -4.875,  -2.875,  -0.875,   1.125,   3.125,   5.125,
               7.125,   9.125,  11.125,  13.125,  15.125,  17.125,  19.125,  21.125,
              23.125,  25.125,  27.125,  29.125,  31.125,  33.125,  35.125,  37.125,
              39.125,  41.125,  43.125,  45.125,  47.125,  49.125,  51.125,  53.125,
              55.125,  57.125,  59.125,  61.125,  63.125,  65.125,  67.125,  69.125,
              71.125,  73.125,  75.125,  77.125,  79.125,  81.125,  83.125,  85.125,
              87.125,  89.125], dtype=float32)
    • longitude
      (longitude)
      float32
      0.875 2.875 4.875 ... 356.9 358.9
      long_name :
      longitude
      standard_name :
      longitude
      units :
      degrees_east
      array([  0.875,   2.875,   4.875,   6.875,   8.875,  10.875,  12.875,  14.875,
              16.875,  18.875,  20.875,  22.875,  24.875,  26.875,  28.875,  30.875,
              32.875,  34.875,  36.875,  38.875,  40.875,  42.875,  44.875,  46.875,
              48.875,  50.875,  52.875,  54.875,  56.875,  58.875,  60.875,  62.875,
              64.875,  66.875,  68.875,  70.875,  72.875,  74.875,  76.875,  78.875,
              80.875,  82.875,  84.875,  86.875,  88.875,  90.875,  92.875,  94.875,
              96.875,  98.875, 100.875, 102.875, 104.875, 106.875, 108.875, 110.875,
             112.875, 114.875, 116.875, 118.875, 120.875, 122.875, 124.875, 126.875,
             128.875, 130.875, 132.875, 134.875, 136.875, 138.875, 140.875, 142.875,
             144.875, 146.875, 148.875, 150.875, 152.875, 154.875, 156.875, 158.875,
             160.875, 162.875, 164.875, 166.875, 168.875, 170.875, 172.875, 174.875,
             176.875, 178.875, 180.875, 182.875, 184.875, 186.875, 188.875, 190.875,
             192.875, 194.875, 196.875, 198.875, 200.875, 202.875, 204.875, 206.875,
             208.875, 210.875, 212.875, 214.875, 216.875, 218.875, 220.875, 222.875,
             224.875, 226.875, 228.875, 230.875, 232.875, 234.875, 236.875, 238.875,
             240.875, 242.875, 244.875, 246.875, 248.875, 250.875, 252.875, 254.875,
             256.875, 258.875, 260.875, 262.875, 264.875, 266.875, 268.875, 270.875,
             272.875, 274.875, 276.875, 278.875, 280.875, 282.875, 284.875, 286.875,
             288.875, 290.875, 292.875, 294.875, 296.875, 298.875, 300.875, 302.875,
             304.875, 306.875, 308.875, 310.875, 312.875, 314.875, 316.875, 318.875,
             320.875, 322.875, 324.875, 326.875, 328.875, 330.875, 332.875, 334.875,
             336.875, 338.875, 340.875, 342.875, 344.875, 346.875, 348.875, 350.875,
             352.875, 354.875, 356.875, 358.875], dtype=float32)
    • air_pressure_at_mean_sea_level
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Mean sea level pressure
      nameCDM :
      Mean_sea_level_pressure_surface
      nameECMWF :
      Mean sea level pressure
      product_type :
      analysis
      shortNameECMWF :
      msl
      standard_name :
      air_pressure_at_mean_sea_level
      units :
      Pa
      [8164800 values with dtype=float32]
    • air_temperature_at_2_metres
      (time, latitude, longitude)
      float32
      ...
      long_name :
      2 metre temperature
      nameCDM :
      2_metre_temperature_surface
      nameECMWF :
      2 metre temperature
      product_type :
      analysis
      shortNameECMWF :
      2t
      standard_name :
      air_temperature
      units :
      K
      [8164800 values with dtype=float32]
    • air_temperature_at_2_metres_1hour_Maximum
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Maximum temperature at 2 metres since previous post-processing
      nameCDM :
      Maximum_temperature_at_2_metres_since_previous_post-processing_surface_1_Hour_2
      nameECMWF :
      Maximum temperature at 2 metres since previous post-processing
      product_type :
      forecast
      shortNameECMWF :
      mx2t
      standard_name :
      air_temperature
      units :
      K
      [8164800 values with dtype=float32]
    • air_temperature_at_2_metres_1hour_Minimum
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Minimum temperature at 2 metres since previous post-processing
      nameCDM :
      Minimum_temperature_at_2_metres_since_previous_post-processing_surface_1_Hour_2
      nameECMWF :
      Minimum temperature at 2 metres since previous post-processing
      product_type :
      forecast
      shortNameECMWF :
      mn2t
      standard_name :
      air_temperature
      units :
      K
      [8164800 values with dtype=float32]
    • dew_point_temperature_at_2_metres
      (time, latitude, longitude)
      float32
      ...
      long_name :
      2 metre dewpoint temperature
      nameCDM :
      2_metre_dewpoint_temperature_surface
      nameECMWF :
      2 metre dewpoint temperature
      product_type :
      analysis
      shortNameECMWF :
      2d
      standard_name :
      dew_point_temperature
      units :
      K
      [8164800 values with dtype=float32]
    • eastward_wind_at_100_metres
      (time, latitude, longitude)
      float32
      ...
      long_name :
      100 metre U wind component
      nameCDM :
      100_metre_U_wind_component_surface
      nameECMWF :
      100 metre U wind component
      product_type :
      analysis
      shortNameECMWF :
      100u
      standard_name :
      eastward_wind
      units :
      m s**-1
      [8164800 values with dtype=float32]
    • eastward_wind_at_10_metres
      (time, latitude, longitude)
      float32
      ...
      long_name :
      10 metre U wind component
      nameCDM :
      10_metre_U_wind_component_surface
      nameECMWF :
      10 metre U wind component
      product_type :
      analysis
      shortNameECMWF :
      10u
      standard_name :
      eastward_wind
      units :
      m s**-1
      [8164800 values with dtype=float32]
    • integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air_1hour_Accumulation
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Surface solar radiation downwards
      nameCDM :
      Surface_solar_radiation_downwards_surface_1_Hour_Accumulation
      nameECMWF :
      Surface solar radiation downwards
      product_type :
      forecast
      shortNameECMWF :
      ssrd
      standard_name :
      integral_wrt_time_of_surface_direct_downwelling_shortwave_flux_in_air
      units :
      J m**-2
      [8164800 values with dtype=float32]
    • lwe_thickness_of_surface_snow_amount
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Snow depth
      nameCDM :
      Snow_depth_surface
      nameECMWF :
      Snow depth
      product_type :
      analysis
      shortNameECMWF :
      sd
      standard_name :
      lwe_thickness_of_surface_snow_amount
      units :
      m of water equivalent
      [8164800 values with dtype=float32]
    • northward_wind_at_100_metres
      (time, latitude, longitude)
      float32
      ...
      long_name :
      100 metre V wind component
      nameCDM :
      100_metre_V_wind_component_surface
      nameECMWF :
      100 metre V wind component
      product_type :
      analysis
      shortNameECMWF :
      100v
      standard_name :
      northward_wind
      units :
      m s**-1
      [8164800 values with dtype=float32]
    • northward_wind_at_10_metres
      (time, latitude, longitude)
      float32
      ...
      long_name :
      10 metre V wind component
      nameCDM :
      10_metre_V_wind_component_surface
      nameECMWF :
      10 metre V wind component
      product_type :
      analysis
      shortNameECMWF :
      10v
      standard_name :
      northward_wind
      units :
      m s**-1
      [8164800 values with dtype=float32]
    • precipitation_amount_1hour_Accumulation
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Total precipitation
      nameCDM :
      Total_precipitation_1hour_Accumulation
      nameECMWF :
      Total precipitation
      product_type :
      forecast
      shortNameECMWF :
      tp
      standard_name :
      precipitation_amount
      units :
      m
      [8164800 values with dtype=float32]
    • sea_surface_temperature
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Sea surface temperature
      nameCDM :
      Sea_surface_temperature_surface
      nameECMWF :
      Sea surface temperature
      product_type :
      analysis
      shortNameECMWF :
      sst
      standard_name :
      sea_surface_temperature
      units :
      K
      [8164800 values with dtype=float32]
    • snow_density
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Snow density
      nameCDM :
      Snow_density_surface
      nameECMWF :
      Snow density
      product_type :
      analysis
      shortNameECMWF :
      rsn
      standard_name :
      snow_density
      units :
      kg m**-3
      [8164800 values with dtype=float32]
    • surface_air_pressure
      (time, latitude, longitude)
      float32
      ...
      long_name :
      Surface pressure
      nameCDM :
      Surface_pressure_surface
      nameECMWF :
      Surface pressure
      product_type :
      analysis
      shortNameECMWF :
      sp
      standard_name :
      surface_air_pressure
      units :
      Pa
      [8164800 values with dtype=float32]
  • institution :
    ECMWF
    source :
    Reanalysis
    title :
    ERA5 forecasts
In [28]:
%%time
file_aws = "https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1"
ds_sst = xr.open_zarr(file_aws, consolidated=True)
ds_sst
CPU times: user 1.28 s, sys: 115 ms, total: 1.4 s
Wall time: 3.01 s
Out[28]:
<xarray.Dataset>
Dimensions:           (time: 6443, lat: 17999, lon: 36000)
Coordinates:
  * lat               (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99
  * lon               (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
  * time              (time) datetime64[ns] 2002-06-01T09:00:00 ... 2020-01-2...
Data variables:
    analysed_sst      (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
    analysis_error    (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
    mask              (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
    sea_ice_fraction  (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
Attributes: (12/47)
    Conventions:                CF-1.7
    Metadata_Conventions:       Unidata Observation Dataset v1.0
    acknowledgment:             Please acknowledge the use of these data with...
    cdm_data_type:              grid
    comment:                    MUR = "Multi-scale Ultra-high Resolution"
    creator_email:              ghrsst@podaac.jpl.nasa.gov
    ...                         ...
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    time_coverage_end:          20200116T210000Z
    time_coverage_start:        20200115T210000Z
    title:                      Daily MUR SST, Final product
    uuid:                       27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude:      -180.0
xarray.Dataset
    • time: 6443
    • lat: 17999
    • lon: 36000
    • lat
      (lat)
      float32
      -89.99 -89.98 ... 89.98 89.99
      axis :
      Y
      comment :
      none
      long_name :
      latitude
      standard_name :
      latitude
      units :
      degrees_north
      valid_max :
      90.0
      valid_min :
      -90.0
      array([-89.99, -89.98, -89.97, ...,  89.97,  89.98,  89.99], dtype=float32)
    • lon
      (lon)
      float32
      -180.0 -180.0 ... 180.0 180.0
      axis :
      X
      comment :
      none
      long_name :
      longitude
      standard_name :
      longitude
      units :
      degrees_east
      valid_max :
      180.0
      valid_min :
      -180.0
      array([-179.99, -179.98, -179.97, ...,  179.98,  179.99,  180.  ],
            dtype=float32)
    • time
      (time)
      datetime64[ns]
      2002-06-01T09:00:00 ... 2020-01-...
      axis :
      T
      comment :
      Nominal time of analyzed fields
      long_name :
      reference time of sst field
      standard_name :
      time
      array(['2002-06-01T09:00:00.000000000', '2002-06-02T09:00:00.000000000',
             '2002-06-03T09:00:00.000000000', ..., '2020-01-18T09:00:00.000000000',
             '2020-01-19T09:00:00.000000000', '2020-01-20T09:00:00.000000000'],
            dtype='datetime64[ns]')
    • analysed_sst
      (time, lat, lon)
      float32
      dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
      comment :
      "Final" version using Multi-Resolution Variational Analysis (MRVA) method for interpolation
      long_name :
      analysed sea surface temperature
      standard_name :
      sea_surface_foundation_temperature
      units :
      kelvin
      valid_max :
      32767
      valid_min :
      -32767
      Array Chunk
      Bytes 15.19 TiB 123.53 MiB
      Shape (6443, 17999, 36000) (5, 1799, 3600)
      Count 2 Graph Layers 141790 Chunks
      Type float32 numpy.ndarray
      36000 17999 6443
    • analysis_error
      (time, lat, lon)
      float32
      dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
      comment :
      none
      long_name :
      estimated error standard deviation of analysed_sst
      units :
      kelvin
      valid_max :
      32767
      valid_min :
      0
      Array Chunk
      Bytes 15.19 TiB 123.53 MiB
      Shape (6443, 17999, 36000) (5, 1799, 3600)
      Count 2 Graph Layers 141790 Chunks
      Type float32 numpy.ndarray
      36000 17999 6443
    • mask
      (time, lat, lon)
      float32
      dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
      comment :
      mask can be used to further filter the data.
      flag_masks :
      [1, 2, 4, 8, 16]
      flag_meanings :
      1=open-sea, 2=land, 5=open-lake, 9=open-sea with ice in the grid, 13=open-lake with ice in the grid
      flag_values :
      [1, 2, 5, 9, 13]
      long_name :
      sea/land field composite mask
      source :
      GMT "grdlandmask", ice flag from sea_ice_fraction data
      valid_max :
      31
      valid_min :
      1
      Array Chunk
      Bytes 15.19 TiB 123.53 MiB
      Shape (6443, 17999, 36000) (5, 1799, 3600)
      Count 2 Graph Layers 141790 Chunks
      Type float32 numpy.ndarray
      36000 17999 6443
    • sea_ice_fraction
      (time, lat, lon)
      float32
      dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray>
      comment :
      ice data interpolated by a nearest neighbor approach.
      long_name :
      sea ice area fraction
      source :
      EUMETSAT OSI-SAF, copyright EUMETSAT
      standard_name :
      sea ice area fraction
      units :
      fraction (between 0 and 1)
      valid_max :
      100
      valid_min :
      0
      Array Chunk
      Bytes 15.19 TiB 123.53 MiB
      Shape (6443, 17999, 36000) (5, 1799, 3600)
      Count 2 Graph Layers 141790 Chunks
      Type float32 numpy.ndarray
      36000 17999 6443
  • Conventions :
    CF-1.7
    Metadata_Conventions :
    Unidata Observation Dataset v1.0
    acknowledgment :
    Please acknowledge the use of these data with the following statement: These data were provided by JPL under support by NASA MEaSUREs program.
    cdm_data_type :
    grid
    comment :
    MUR = "Multi-scale Ultra-high Resolution"
    creator_email :
    ghrsst@podaac.jpl.nasa.gov
    creator_name :
    JPL MUR SST project
    creator_url :
    http://mur.jpl.nasa.gov
    date_created :
    20200124T010755Z
    easternmost_longitude :
    180.0
    file_quality_level :
    3
    gds_version_id :
    2.0
    geospatial_lat_resolution :
    0.009999999776482582
    geospatial_lat_units :
    degrees north
    geospatial_lon_resolution :
    0.009999999776482582
    geospatial_lon_units :
    degrees east
    history :
    created at nominal 4-day latency; replaced nrt (1-day latency) version.
    id :
    MUR-JPL-L4-GLOB-v04.1
    institution :
    Jet Propulsion Laboratory
    keywords :
    Oceans > Ocean Temperature > Sea Surface Temperature
    keywords_vocabulary :
    NASA Global Change Master Directory (GCMD) Science Keywords
    license :
    These data are available free of charge under data policy of JPL PO.DAAC.
    metadata_link :
    http://podaac.jpl.nasa.gov/ws/metadata/dataset/?format=iso&shortName=MUR-JPL-L4-GLOB-v04.1
    naming_authority :
    org.ghrsst
    netcdf_version_id :
    4.1
    northernmost_latitude :
    90.0
    platform :
    Terra, Aqua, GCOM-W, MetOp-A, MetOp-B, Buoys/Ships
    processing_level :
    L4
    product_version :
    04.1
    project :
    NASA Making Earth Science Data Records for Use in Research Environments (MEaSUREs) Program
    publisher_email :
    ghrsst-po@nceo.ac.uk
    publisher_name :
    GHRSST Project Office
    publisher_url :
    http://www.ghrsst.org
    references :
    http://podaac.jpl.nasa.gov/Multi-scale_Ultra-high_Resolution_MUR-SST
    sensor :
    MODIS, AMSR2, AVHRR, in-situ
    source :
    MODIS_T-JPL, MODIS_A-JPL, AMSR2-REMSS, AVHRRMTA_G-NAVO, AVHRRMTB_G-NAVO, iQUAM-NOAA/NESDIS, Ice_Conc-OSISAF
    southernmost_latitude :
    -90.0
    spatial_resolution :
    0.01 degrees
    standard_name_vocabulary :
    NetCDF Climate and Forecast (CF) Metadata Convention
    start_time :
    20200116T090000Z
    stop_time :
    20200116T090000Z
    summary :
    A merged, multi-sensor L4 Foundation SST analysis product from JPL.
    time_coverage_end :
    20200116T210000Z
    time_coverage_start :
    20200115T210000Z
    title :
    Daily MUR SST, Final product
    uuid :
    27665bc0-d5fc-11e1-9b23-0800200c9a66
    westernmost_longitude :
    -180.0
{Warning}

The above picture is incomplete...
In [29]:
ds = xr.open_dataset("data/test_hgroups.nc")
ds
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    200             try:
--> 201                 file = self._cache[self._key]
    202             except KeyError:

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     54         with self._lock:
---> 55             value = self._cache[key]
     56             self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/home/jovyan/sp23-dev/lec/lec22/data/test_hgroups.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_758/987355351.py in <module>
----> 1 ds = xr.open_dataset("data/test_hgroups.nc")
      2 ds

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs)
    529 
    530     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 531     backend_ds = backend.open_dataset(
    532         filename_or_obj,
    533         drop_variables=drop_variables,

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    553 
    554         filename_or_obj = _normalize_path(filename_or_obj)
--> 555         store = NetCDF4DataStore.open(
    556             filename_or_obj,
    557             mode=mode,

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    382             netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    383         )
--> 384         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
    385 
    386     def _acquire(self, needs_lock=True):

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in __init__(self, manager, group, mode, lock, autoclose)
    330         self._group = group
    331         self._mode = mode
--> 332         self.format = self.ds.data_model
    333         self._filename = self.ds.filepath()
    334         self.is_remote = is_remote_uri(self._filename)

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in ds(self)
    391     @property
    392     def ds(self):
--> 393         return self._acquire()
    394 
    395     def open_store_variable(self, name, var):

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in _acquire(self, needs_lock)
    385 
    386     def _acquire(self, needs_lock=True):
--> 387         with self._manager.acquire_context(needs_lock) as root:
    388             ds = _nc4_require_group(root, self._group, self._mode)
    389         return ds

/srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self)
    115         del self.args, self.kwds, self.func
    116         try:
--> 117             return next(self.gen)
    118         except StopIteration:
    119             raise RuntimeError("generator didn't yield") from None

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
    187     def acquire_context(self, needs_lock=True):
    188         """Context manager for acquiring a file."""
--> 189         file, cached = self._acquire_with_cache_info(needs_lock)
    190         try:
    191             yield file

/srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    205                     kwargs = kwargs.copy()
    206                     kwargs["mode"] = self._mode
--> 207                 file = self._opener(*self._args, **kwargs)
    208                 if self._mode == "w":
    209                     # ensure file doesn't get overridden when opened again

src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

FileNotFoundError: [Errno 2] No such file or directory: b'/home/jovyan/sp23-dev/lec/lec22/data/test_hgroups.nc'
In [ ]:
import netCDF4 as nc

dsn = nc.Dataset("data/test_hgroups.nc")
dsn
In [ ]:
ds4 = xr.open_dataset("data/test_hgroups.nc",
                     group="mozaic_flight_2012030403540535_ascent")
ds4