Skip to content

Dataset.to_zarr with zarr.storage.ZipStore produces corrupt output with a dask-chunked dataset #10840

@csubich

Description

@csubich

What happened?

Possibly related to #10827, the Dataset.to_zarr method silently produces corrupt output when the input dataset has been chunked (even trivially) with dask arrays. The resulting ZipStore writes with only the UserWarnings, but it cannot be read with xr.open_zarr (via another, read-only ZipStore). The file utility also reports that the zip is "empty," and zip -t reports bad offset values.

Note that this does not happen when the data is auto-chunked by zarr upon output, even if the auto-chunking is nontrivial.

What did you expect to happen?

One expects the Zarr store written by Dataset.to_zarr to be valid readable by xr.open_zarr.

Minimal Complete Verifiable Example

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "xarray[complete]@git+https://github.com/pydata/xarray.git@main",
# ]
# ///
#
# This script automatically imports the development branch of xarray to check for issues.
# Please delete this header if you have _not_ tested this script with `uv run`!

import xarray as xr
xr.show_versions()
# your reproducer code ...

import numpy as np
import zarr

ds = xr.Dataset(data_vars = {'foo' : (('dim1',) , np.zeros(10))})
ds = ds.chunk(dim1=-1) # It works if you comment out this line

fname = 'foo.zip'
zs1 = zarr.storage.ZipStore(fname,mode='w',read_only=False)
out1 = ds.to_zarr(zs1,compute=True)
out1.close()
zs1.close()

zs2 = zarr.storage.ZipStore(fname,mode='r',read_only=True)
xr.open_zarr(zs2)

## ---------------------------------------------------------------------------
## BadZipFile                                Traceback (most recent call last)
## [...]
## BadZipFile: Bad magic number for file header

Steps to reproduce

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

/srv/conda/envs/notebook/lib/python3.11/zipfile.py:1567: UserWarning: Duplicate name: 'zarr.json'
  return self._open_to_write(zinfo, force_zip64=force_zip64)
/srv/conda/envs/notebook/lib/python3.11/zipfile.py:1567: UserWarning: Duplicate name: 'foo/zarr.json'
  return self._open_to_write(zinfo, force_zip64=force_zip64)
/srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/asynchronous.py:244: ZarrUserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
Cell In[6], line 15
     12 zs1.close()
     14 zs2 = zarr.storage.ZipStore(fname,mode='r',read_only=True)
---> 15 xr.open_zarr(zs2)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:1586, in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, zarr_version, zarr_format, use_zarr_fill_value_as_mask, chunked_array_type, from_array_kwargs, create_default_indexes, **kwargs)
   1572     raise TypeError(
   1573         "open_zarr() got unexpected keyword arguments " + ",".join(kwargs.keys())
   1574     )
   1576 backend_kwargs = {
   1577     "synchronizer": synchronizer,
   1578     "consolidated": consolidated,
   (...)   1583     "zarr_format": zarr_format,
   1584 }
-> 1586 ds = open_dataset(
   1587     filename_or_obj=store,
   1588     group=group,
   1589     decode_cf=decode_cf,
   1590     mask_and_scale=mask_and_scale,
   1591     decode_times=decode_times,
   1592     concat_characters=concat_characters,
   1593     decode_coords=decode_coords,
   1594     engine="zarr",
   1595     chunks=chunks,
   1596     drop_variables=drop_variables,
   1597     create_default_indexes=create_default_indexes,
   1598     chunked_array_type=chunked_array_type,
   1599     from_array_kwargs=from_array_kwargs,
   1600     backend_kwargs=backend_kwargs,
   1601     decode_timedelta=decode_timedelta,
   1602     use_cftime=use_cftime,
   1603     zarr_version=zarr_version,
   1604     use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
   1605 )
   1606 return ds

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/api.py:596, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
    584 decoders = _resolve_decoders_kwargs(
    585     decode_cf,
    586     open_backend_dataset_parameters=backend.open_dataset_parameters,
   (...)    592     decode_coords=decode_coords,
    593 )
    595 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 596 backend_ds = backend.open_dataset(
    597     filename_or_obj,
    598     drop_variables=drop_variables,
    599     **decoders,
    600     **kwargs,
    601 )
    602 ds = _dataset_from_backend_dataset(
    603     backend_ds,
    604     filename_or_obj,
   (...)    615     **kwargs,
    616 )
    617 return ds

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:1660, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, zarr_version, zarr_format, store, engine, use_zarr_fill_value_as_mask, cache_members)
   1658 filename_or_obj = _normalize_path(filename_or_obj)
   1659 if not store:
-> 1660     store = ZarrStore.open_group(
   1661         filename_or_obj,
   1662         group=group,
   1663         mode=mode,
   1664         synchronizer=synchronizer,
   1665         consolidated=consolidated,
   1666         consolidate_on_close=False,
   1667         chunk_store=chunk_store,
   1668         storage_options=storage_options,
   1669         zarr_version=zarr_version,
   1670         use_zarr_fill_value_as_mask=None,
   1671         zarr_format=zarr_format,
   1672         cache_members=cache_members,
   1673     )
   1675 store_entrypoint = StoreBackendEntrypoint()
   1676 with close_on_error(store):

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:714, in ZarrStore.open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, align_chunks, zarr_version, zarr_format, use_zarr_fill_value_as_mask, write_empty, cache_members)
    688 @classmethod
    689 def open_group(
    690     cls,
   (...)    707     cache_members: bool = True,
    708 ):
    709     (
    710         zarr_group,
    711         consolidate_on_close,
    712         close_store_on_close,
    713         use_zarr_fill_value_as_mask,
--> 714     ) = _get_open_params(
    715         store=store,
    716         mode=mode,
    717         synchronizer=synchronizer,
    718         group=group,
    719         consolidated=consolidated,
    720         consolidate_on_close=consolidate_on_close,
    721         chunk_store=chunk_store,
    722         storage_options=storage_options,
    723         zarr_version=zarr_version,
    724         use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
    725         zarr_format=zarr_format,
    726     )
    728     return cls(
    729         zarr_group,
    730         mode,
   (...)    739         cache_members=cache_members,
    740     )

File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:1868, in _get_open_params(store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, zarr_version, use_zarr_fill_value_as_mask, zarr_format)
   1865 elif consolidated is None:
   1866     # same but with more error handling in case no consolidated metadata found
   1867     try:
-> 1868         zarr_root_group = zarr.open_consolidated(store, **open_kwargs)
   1869     except (ValueError, KeyError):
   1870         # ValueError in zarr-python 3.x, KeyError in 2.x.
   1871         try:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/synchronous.py:231, in open_consolidated(use_consolidated, *args, **kwargs)
    226 def open_consolidated(*args: Any, use_consolidated: Literal[True] = True, **kwargs: Any) -> Group:
    227     """
    228     Alias for :func:`open_group` with ``use_consolidated=True``.
    229     """
    230     return Group(
--> 231         sync(async_api.open_consolidated(*args, use_consolidated=use_consolidated, **kwargs))
    232     )

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/asynchronous.py:408, in open_consolidated(use_consolidated, *args, **kwargs)
    403 if use_consolidated is not True:
    404     raise TypeError(
    405         "'use_consolidated' must be 'True' in 'open_consolidated'. Use 'open' with "
    406         "'use_consolidated=False' to bypass consolidated metadata."
    407     )
--> 408 return await open_group(*args, use_consolidated=use_consolidated, **kwargs)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/asynchronous.py:857, in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options, zarr_version, zarr_format, meta_array, attributes, use_consolidated)
    855 try:
    856     if mode in _READ_MODES:
--> 857         return await AsyncGroup.open(
    858             store_path, zarr_format=zarr_format, use_consolidated=use_consolidated
    859         )
    860 except (KeyError, FileNotFoundError):
    861     pass

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/core/group.py:559, in AsyncGroup.open(cls, store, zarr_format, use_consolidated)
    552         raise FileNotFoundError(store_path)
    553 elif zarr_format is None:
    554     (
    555         zarr_json_bytes,
    556         zgroup_bytes,
    557         zattrs_bytes,
    558         maybe_consolidated_metadata_bytes,
--> 559     ) = await asyncio.gather(
    560         (store_path / ZARR_JSON).get(),
    561         (store_path / ZGROUP_JSON).get(),
    562         (store_path / ZATTRS_JSON).get(),
    563         (store_path / str(consolidated_key)).get(),
    564     )
    565     if zarr_json_bytes is not None and zgroup_bytes is not None:
    566         # warn and favor v3
    567         msg = f"Both zarr.json (Zarr format 3) and .zgroup (Zarr format 2) metadata objects exist at {store_path}. Zarr format 3 will be used."

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/storage/_common.py:168, in StorePath.get(self, prototype, byte_range)
    166 if prototype is None:
    167     prototype = default_buffer_prototype()
--> 168 return await self.store.get(self.path, prototype=prototype, byte_range=byte_range)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/storage/_zip.py:183, in ZipStore.get(self, key, prototype, byte_range)
    180 assert isinstance(key, str)
    182 with self._lock:
--> 183     return self._get(key, prototype=prototype, byte_range=byte_range)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/storage/_zip.py:156, in ZipStore._get(self, key, prototype, byte_range)
    154 # docstring inherited
    155 try:
--> 156     with self._zf.open(key) as f:  # will raise KeyError
    157         if byte_range is None:
    158             return prototype.buffer.from_bytes(f.read())

File /srv/conda/envs/notebook/lib/python3.11/zipfile.py:1585, in ZipFile.open(self, name, mode, pwd, force_zip64)
   1583 fheader = struct.unpack(structFileHeader, fheader)
   1584 if fheader[_FH_SIGNATURE] != stringFileHeader:
-> 1585     raise BadZipFile("Bad magic number for file header")
   1587 fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])
   1588 if fheader[_FH_EXTRA_FIELD_LENGTH]:

BadZipFile: Bad magic number for file header

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 6.8.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.6
libnetcdf: 4.9.3

xarray: 2025.10.1
pandas: 2.3.3
numpy: 2.3.3
scipy: 1.16.2
netCDF4: 1.7.2
pydap: 3.5.8
h5netcdf: 1.6.4
h5py: 3.14.0
zarr: 3.1.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.13.1
bottleneck: 1.6.0
dask: 2025.9.1
distributed: 2025.9.1
matplotlib: 3.10.6
cartopy: 0.25.0
seaborn: 0.13.2
numbagg: 0.9.3
fsspec: 2025.9.0
cupy: None
pint: 0.25
sparse: 0.17.0
flox: None
numpy_groupies: None
setuptools: 80.9.0
pip: 25.2
conda: None
pytest: None
mypy: None
IPython: 9.4.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugneeds triageIssue that has not been reviewed by xarray team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions