Native Python serialization
The following types can be pickled:
For our purposes, the list stops here. In reality, the following can also be pickled, but I do not recommend doing so for now. There are scenarios where pickling these things makes sense, but they are more advanced and we will not discuss them for now:
__dict__
or the result of calling __getstate__()
is picklable.Warning:
The pickle
module is not secure! Only unpickle data you trust.
import pickle
pickle.dumps(1234, protocol=0)
b'I1234\n.'
pickle.dumps([1])
b'\x80\x04\x95\x06\x00\x00\x00\x00\x00\x00\x00]\x94K\x01a.'
a = [1, 1.5, "hello", {3, 4}, {'int': 9, 'real': 9.0, 'complex': 9j}]
with open('data.pkl', 'wb') as f:
pickle.dump(a, f)
with open('data.pkl', 'rb') as f:
b = pickle.load(f)
b
[1, 1.5, 'hello', {3, 4}, {'int': 9, 'real': 9.0, 'complex': 9j}]
The shelve
module provides a disk-stored object that behaves (mostly) like a dict, whose keys are strings and whose values are anything that can be pickled.
import shelve
x = 1234
with shelve.open('spam') as db:
db['eggs'] = 'eggs'
db['numbers'] = [1,2, 3, 9.99, 1j]
db['xx'] = x
%ls -l spam.*
-rw-r--r-- 1 jovyan jovyan 53 Apr 6 17:52 spam.bak -rw-r--r-- 1 jovyan jovyan 1030 Apr 6 17:52 spam.dat -rw-r--r-- 1 jovyan jovyan 53 Apr 6 17:52 spam.dir
Note:
Do not rely on the shelf being closed automatically; always call close()
explicitly when you don’t need it any more, or use shelve.open()
as a context manager, as shown above.
with shelve.open('spam') as db:
e = db['eggs']
n = db['numbers']
print(f'{e = }')
print(f'{n = }')
e = 'eggs' n = [1, 2, 3, 9.99, 1j]
db = shelve.open('spam')
for var, data in db.items():
print(f'{var} = {data}')
eggs = eggs numbers = [1, 2, 3, 9.99, 1j] xx = 1234
db.close()
How to save and load NumPy objects provides more details, and all input/output APIs in numpy are described here.
As a minimum starter, you should know that:
.npy
files. These are portable across machines and versions of Numpy..npz
file.The relationship between npy
and npz
files is somewhat similar to that between single pickles and shelve objects in Python.
import numpy as np
a = np.array([1, 2, 3.4])
fname = 'arr.npy'
np.save(fname, a)
b = np.load(fname)
(a == b).all()
True
Multiple arrays (or scalar data) can be saved in a shelve-like object with the np.savez()
function, that writes .npz
files:
fname = 'arrays.npz'
np.savez(fname, a=a, b=np.random.normal(10), c=3.4)
arrays = np.load(fname)
arrays.files
['a', 'b', 'c']
arrays['a']
array([1. , 2. , 3.4])
arrays['c']
array(3.4)
JSON stands for JavaScript Object Notation - it is a human-readable format (it looks kind of like a Python dict; Jupyter notebooks are JSON files on-disk) that can represent native JavaScript types. In some cases it can be an alternative to Pickle, with the advantage of being natively portable to the Web and the JavaScript ecosystem. From the Python pickle docs, this quick comparison is useful for our purposes:
JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;
JSON is human-readable, while pickle is not;
JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);
Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.
import json
a = ['foo', {'bar': ['baz', None, 1.0, 2]}]
json.dumps(a)
'["foo", {"bar": ["baz", null, 1.0, 2]}]'
with open('test.json', 'w') as f:
json.dump(a, f)
with open('test.json', 'r') as f:
b = json.load(f)
b
['foo', {'bar': ['baz', None, 1.0, 2]}]
a == b
True
But be careful:
c = ['foo', {'bar': ('baz', None, 1.0, 2)}]
with open('test2.json', 'w') as f:
json.dump(c, f)
with open('test2.json', 'r') as f:
d = json.load(f)
c == d
False
d
['foo', {'bar': ['baz', None, 1.0, 2]}]
from IPython.display import JSON
JSON(c)
<IPython.core.display.JSON object>
GeoJSON: the meaning of a schema.
classroom = {
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [-122.25915, 37.87125]
},
"properties": {
"name": "Wheeler Hall Auditorium"
}
}
JSON(classroom)
<IPython.core.display.JSON object>
from IPython.display import GeoJSON
GeoJSON(classroom)
<IPython.display.GeoJSON object>
Some useful performance comparisons regarding various ways of saving dataframes.
from pathlib import Path
import pandas as pd
df = pd.read_csv(Path.home()/"shared/climate-data/monthly_in_situ_co2_mlo_cleaned.csv")
df
year | month | date_index | fraction_date | c02 | data_adjusted_season | data_fit | data_adjusted_seasonally_fit | data_filled | data_adjusted_seasonally_filed | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1958 | 1 | 21200 | 1958.0411 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
1 | 1958 | 2 | 21231 | 1958.1260 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
2 | 1958 | 3 | 21259 | 1958.2027 | 315.70 | 314.43 | 316.19 | 314.90 | 315.70 | 314.43 |
3 | 1958 | 4 | 21290 | 1958.2877 | 317.45 | 315.16 | 317.30 | 314.98 | 317.45 | 315.16 |
4 | 1958 | 5 | 21320 | 1958.3699 | 317.51 | 314.71 | 317.86 | 315.06 | 317.51 | 314.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 2021 | 8 | 44423 | 2021.6219 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
764 | 2021 | 9 | 44454 | 2021.7068 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
765 | 2021 | 10 | 44484 | 2021.7890 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
766 | 2021 | 11 | 44515 | 2021.8740 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
767 | 2021 | 12 | 44545 | 2021.9562 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
768 rows × 10 columns
%ls -l ~/shared/climate-data/monthly_in_situ_co2_mlo_cleaned.csv
-rw-r--r-- 1 jovyan jovyan 50201 Nov 3 07:10 /home/jovyan/shared/climate-data/monthly_in_situ_co2_mlo_cleaned.csv
df.to_feather("co2.fth")
%ls -l co2*
-rw-r--r-- 1 jovyan jovyan 32218 Apr 6 17:52 co2.fth
df2 = pd.read_feather("co2.fth")
df2
year | month | date_index | fraction_date | c02 | data_adjusted_season | data_fit | data_adjusted_seasonally_fit | data_filled | data_adjusted_seasonally_filed | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1958 | 1 | 21200 | 1958.0411 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
1 | 1958 | 2 | 21231 | 1958.1260 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
2 | 1958 | 3 | 21259 | 1958.2027 | 315.70 | 314.43 | 316.19 | 314.90 | 315.70 | 314.43 |
3 | 1958 | 4 | 21290 | 1958.2877 | 317.45 | 315.16 | 317.30 | 314.98 | 317.45 | 315.16 |
4 | 1958 | 5 | 21320 | 1958.3699 | 317.51 | 314.71 | 317.86 | 315.06 | 317.51 | 314.71 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 2021 | 8 | 44423 | 2021.6219 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
764 | 2021 | 9 | 44454 | 2021.7068 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
765 | 2021 | 10 | 44484 | 2021.7890 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
766 | 2021 | 11 | 44515 | 2021.8740 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
767 | 2021 | 12 | 44545 | 2021.9562 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 | -99.99 |
768 rows × 10 columns
Here is a nice introduction to HDF5 from our NERSC friends, and this is a good intro tutorial with code examples. The docs for the h5py Python library have more techincal details.
In brief (oversimplifying, but OK for our purposes):
{Note}
When we say NetCDF, we will strictly mean NetCDF4. There's an older version 3 that wasn't based on HDF5, and that we will not discuss further.
Today, most NetCDF files you encountered use the HDF5 binary format for storage, but as of 2020 NetCDF data can also be stored using the Zarr format that is more suited for cloud storage than HDF5, which was mostly designed for supercomputers.
So, the picture is:
h5
or zarr
, but for larger data it does.from pathlib import Path
import xarray as xr
DATA_DIR = Path.home()/Path('shared/climate-data')
ds = xr.open_dataset(DATA_DIR / "era5_monthly_2deg_aws_v20210920.nc")
ds
<xarray.Dataset> Dimensions: ( time: 504, latitude: 90, longitude: 180) Coordinates: * time (time) datetime64[ns] ... * latitude (latitude) float32 ... * longitude (longitude) float32 ... Data variables: (12/15) air_pressure_at_mean_sea_level (time, latitude, longitude) float32 ... air_temperature_at_2_metres (time, latitude, longitude) float32 ... air_temperature_at_2_metres_1hour_Maximum (time, latitude, longitude) float32 ... air_temperature_at_2_metres_1hour_Minimum (time, latitude, longitude) float32 ... dew_point_temperature_at_2_metres (time, latitude, longitude) float32 ... eastward_wind_at_100_metres (time, latitude, longitude) float32 ... ... ... northward_wind_at_100_metres (time, latitude, longitude) float32 ... northward_wind_at_10_metres (time, latitude, longitude) float32 ... precipitation_amount_1hour_Accumulation (time, latitude, longitude) float32 ... sea_surface_temperature (time, latitude, longitude) float32 ... snow_density (time, latitude, longitude) float32 ... surface_air_pressure (time, latitude, longitude) float32 ... Attributes: institution: ECMWF source: Reanalysis title: ERA5 forecasts
%%time
file_aws = "https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1"
ds_sst = xr.open_zarr(file_aws, consolidated=True)
ds_sst
CPU times: user 1.28 s, sys: 115 ms, total: 1.4 s Wall time: 3.01 s
<xarray.Dataset> Dimensions: (time: 6443, lat: 17999, lon: 36000) Coordinates: * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 * time (time) datetime64[ns] 2002-06-01T09:00:00 ... 2020-01-2... Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> mask (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.7 Metadata_Conventions: Unidata Observation Dataset v1.0 acknowledgment: Please acknowledge the use of these data with... cdm_data_type: grid comment: MUR = "Multi-scale Ultra-high Resolution" creator_email: ghrsst@podaac.jpl.nasa.gov ... ... summary: A merged, multi-sensor L4 Foundation SST anal... time_coverage_end: 20200116T210000Z time_coverage_start: 20200115T210000Z title: Daily MUR SST, Final product uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66 westernmost_longitude: -180.0
{Warning}
The above picture is incomplete...
ds = xr.open_dataset("data/test_hgroups.nc")
ds
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 200 try: --> 201 file = self._cache[self._key] 202 except KeyError: /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key) 54 with self._lock: ---> 55 value = self._cache[key] 56 self._cache.move_to_end(key) KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/home/jovyan/sp23-dev/lec/lec22/data/test_hgroups.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))] During handling of the above exception, another exception occurred: FileNotFoundError Traceback (most recent call last) /tmp/ipykernel_758/987355351.py in <module> ----> 1 ds = xr.open_dataset("data/test_hgroups.nc") 2 ds /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs) 529 530 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 531 backend_ds = backend.open_dataset( 532 filename_or_obj, 533 drop_variables=drop_variables, /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose) 553 554 filename_or_obj = _normalize_path(filename_or_obj) --> 555 store = NetCDF4DataStore.open( 556 filename_or_obj, 557 mode=mode, /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose) 382 netCDF4.Dataset, filename, mode=mode, kwargs=kwargs 383 ) --> 384 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose) 385 386 def _acquire(self, needs_lock=True): /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in __init__(self, manager, group, mode, lock, autoclose) 330 self._group = group 331 self._mode = mode --> 332 self.format = self.ds.data_model 333 self._filename = self.ds.filepath() 334 self.is_remote = is_remote_uri(self._filename) /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in ds(self) 391 @property 392 def ds(self): --> 393 return self._acquire() 394 395 def open_store_variable(self, name, var): /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/netCDF4_.py in _acquire(self, needs_lock) 385 386 def _acquire(self, needs_lock=True): --> 387 with self._manager.acquire_context(needs_lock) as root: 388 ds = _nc4_require_group(root, self._group, self._mode) 389 return ds /srv/conda/envs/notebook/lib/python3.9/contextlib.py in __enter__(self) 115 del self.args, self.kwds, self.func 116 try: --> 117 return next(self.gen) 118 except StopIteration: 119 raise RuntimeError("generator didn't yield") from None /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock) 187 def acquire_context(self, needs_lock=True): 188 """Context manager for acquiring a file.""" --> 189 file, cached = self._acquire_with_cache_info(needs_lock) 190 try: 191 yield file /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock) 205 kwargs = kwargs.copy() 206 kwargs["mode"] = self._mode --> 207 file = self._opener(*self._args, **kwargs) 208 if self._mode == "w": 209 # ensure file doesn't get overridden when opened again src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__() src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success() FileNotFoundError: [Errno 2] No such file or directory: b'/home/jovyan/sp23-dev/lec/lec22/data/test_hgroups.nc'
import netCDF4 as nc
dsn = nc.Dataset("data/test_hgroups.nc")
dsn
ds4 = xr.open_dataset("data/test_hgroups.nc",
group="mozaic_flight_2012030403540535_ascent")
ds4