Skip to content

How easy can it be to create a zarr array #2083

@d-v-b

Description

@d-v-b

Users come to Zarr with a variety of array-like objects -- numpy arrays, or dask arrays, or xarray DataArrays, zarr v2 arrays, zarr v3 arrays, etc. Imagine a venn diagram of attributes / methods for these objects: shape, __getitem__, dtype would be in the shared middle, and chunks, chunksize, attrs, dims, codecs, filters in the disjoint periphery. How can we conveniently model an arbitrary array-like object as a Zarr array? In particular, how can we ensure that you can create a complete Zarr array from an existing array-like object (which might be already a zarr array) with a single function call?

If we agree on that objective, then here is a rough outline of what that function could look like:

  • we should have a top-level from_array method that creates a Zarr array from an existing array-like object.
# numpy
np_arr = np.zeros(10)
zarr.from_array(np_arr) # memorystore-backed zarr v3 array with shape 10 and dtype float64, and default parameters for everything else
zarr.from_array(np_arr, zarr_format=2, compressor=Gzip(), attributes={'foo': 10}) # same as above, but v2, with gzip, and attributes

# dask
da_arr = da.zeros((10,), chunks=(1,))
zarr.from_array(da_arr) # inherits the `chunks` attribute from the array
zarr.from_array(da_arr, chunking_bikeshed=(2,)) # overrides the chunks attribute, kwarg name tbd 🙃 

# xarray
xr_arr = xarray.DataArray(np.zeros(10), attrs={'foo': 10}, dims=('dim_0',))
zarr.from_array(xr_arr) # zarr v3 array with dimension names inherited xr_arr.dims, attrs from xr_arr.attrs)

# zarr
zarr.from_array(zarr.zeros(10)) # makes a copy of the array

some open questions:

  • should we copy data? over in pydantic-zarr I implemented a from_array function that only creates array metadata, because users might not want to eagerly move 10 TB of data at array definition time. Perhaps this could be controlled with a keyword argument.
  • should we support creating v2 arrays through this API, or use a v2.from_array function for that? I'm fine either way.
  • How much work is required to implicitly model the different array-like libraries enough for the above functionality to be useful?
  • There is a similar question about zarr groups, but the set of "zarr-group-like objects" is a bit narrower than array-likes.
    Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions