-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened?
Possibly related to #10827, the Dataset.to_zarr method silently produces corrupt output when the input dataset has been chunked (even trivially) with dask arrays. The resulting ZipStore writes with only the UserWarnings, but it cannot be read with xr.open_zarr (via another, read-only ZipStore). The file utility also reports that the zip is "empty," and zip -t reports bad offset values.
Note that this does not happen when the data is auto-chunked by zarr upon output, even if the auto-chunking is nontrivial.
What did you expect to happen?
One expects the Zarr store written by Dataset.to_zarr to be valid readable by xr.open_zarr.
Minimal Complete Verifiable Example
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "xarray[complete]@git+https://github.com/pydata/xarray.git@main",
# ]
# ///
#
# This script automatically imports the development branch of xarray to check for issues.
# Please delete this header if you have _not_ tested this script with `uv run`!
import xarray as xr
xr.show_versions()
# your reproducer code ...
import numpy as np
import zarr
ds = xr.Dataset(data_vars = {'foo' : (('dim1',) , np.zeros(10))})
ds = ds.chunk(dim1=-1) # It works if you comment out this line
fname = 'foo.zip'
zs1 = zarr.storage.ZipStore(fname,mode='w',read_only=False)
out1 = ds.to_zarr(zs1,compute=True)
out1.close()
zs1.close()
zs2 = zarr.storage.ZipStore(fname,mode='r',read_only=True)
xr.open_zarr(zs2)
## ---------------------------------------------------------------------------
## BadZipFile Traceback (most recent call last)
## [...]
## BadZipFile: Bad magic number for file headerSteps to reproduce
No response
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
/srv/conda/envs/notebook/lib/python3.11/zipfile.py:1567: UserWarning: Duplicate name: 'zarr.json'
return self._open_to_write(zinfo, force_zip64=force_zip64)
/srv/conda/envs/notebook/lib/python3.11/zipfile.py:1567: UserWarning: Duplicate name: 'foo/zarr.json'
return self._open_to_write(zinfo, force_zip64=force_zip64)
/srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/asynchronous.py:244: ZarrUserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
warnings.warn(
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
Cell In[6], line 15
12 zs1.close()
14 zs2 = zarr.storage.ZipStore(fname,mode='r',read_only=True)
---> 15 xr.open_zarr(zs2)
File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:1586, in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, zarr_version, zarr_format, use_zarr_fill_value_as_mask, chunked_array_type, from_array_kwargs, create_default_indexes, **kwargs)
1572 raise TypeError(
1573 "open_zarr() got unexpected keyword arguments " + ",".join(kwargs.keys())
1574 )
1576 backend_kwargs = {
1577 "synchronizer": synchronizer,
1578 "consolidated": consolidated,
(...) 1583 "zarr_format": zarr_format,
1584 }
-> 1586 ds = open_dataset(
1587 filename_or_obj=store,
1588 group=group,
1589 decode_cf=decode_cf,
1590 mask_and_scale=mask_and_scale,
1591 decode_times=decode_times,
1592 concat_characters=concat_characters,
1593 decode_coords=decode_coords,
1594 engine="zarr",
1595 chunks=chunks,
1596 drop_variables=drop_variables,
1597 create_default_indexes=create_default_indexes,
1598 chunked_array_type=chunked_array_type,
1599 from_array_kwargs=from_array_kwargs,
1600 backend_kwargs=backend_kwargs,
1601 decode_timedelta=decode_timedelta,
1602 use_cftime=use_cftime,
1603 zarr_version=zarr_version,
1604 use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
1605 )
1606 return ds
File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/api.py:596, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, create_default_indexes, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
584 decoders = _resolve_decoders_kwargs(
585 decode_cf,
586 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...) 592 decode_coords=decode_coords,
593 )
595 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 596 backend_ds = backend.open_dataset(
597 filename_or_obj,
598 drop_variables=drop_variables,
599 **decoders,
600 **kwargs,
601 )
602 ds = _dataset_from_backend_dataset(
603 backend_ds,
604 filename_or_obj,
(...) 615 **kwargs,
616 )
617 return ds
File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:1660, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, zarr_version, zarr_format, store, engine, use_zarr_fill_value_as_mask, cache_members)
1658 filename_or_obj = _normalize_path(filename_or_obj)
1659 if not store:
-> 1660 store = ZarrStore.open_group(
1661 filename_or_obj,
1662 group=group,
1663 mode=mode,
1664 synchronizer=synchronizer,
1665 consolidated=consolidated,
1666 consolidate_on_close=False,
1667 chunk_store=chunk_store,
1668 storage_options=storage_options,
1669 zarr_version=zarr_version,
1670 use_zarr_fill_value_as_mask=None,
1671 zarr_format=zarr_format,
1672 cache_members=cache_members,
1673 )
1675 store_entrypoint = StoreBackendEntrypoint()
1676 with close_on_error(store):
File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:714, in ZarrStore.open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, align_chunks, zarr_version, zarr_format, use_zarr_fill_value_as_mask, write_empty, cache_members)
688 @classmethod
689 def open_group(
690 cls,
(...) 707 cache_members: bool = True,
708 ):
709 (
710 zarr_group,
711 consolidate_on_close,
712 close_store_on_close,
713 use_zarr_fill_value_as_mask,
--> 714 ) = _get_open_params(
715 store=store,
716 mode=mode,
717 synchronizer=synchronizer,
718 group=group,
719 consolidated=consolidated,
720 consolidate_on_close=consolidate_on_close,
721 chunk_store=chunk_store,
722 storage_options=storage_options,
723 zarr_version=zarr_version,
724 use_zarr_fill_value_as_mask=use_zarr_fill_value_as_mask,
725 zarr_format=zarr_format,
726 )
728 return cls(
729 zarr_group,
730 mode,
(...) 739 cache_members=cache_members,
740 )
File /srv/conda/envs/notebook/lib/python3.11/site-packages/xarray/backends/zarr.py:1868, in _get_open_params(store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, zarr_version, use_zarr_fill_value_as_mask, zarr_format)
1865 elif consolidated is None:
1866 # same but with more error handling in case no consolidated metadata found
1867 try:
-> 1868 zarr_root_group = zarr.open_consolidated(store, **open_kwargs)
1869 except (ValueError, KeyError):
1870 # ValueError in zarr-python 3.x, KeyError in 2.x.
1871 try:
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/synchronous.py:231, in open_consolidated(use_consolidated, *args, **kwargs)
226 def open_consolidated(*args: Any, use_consolidated: Literal[True] = True, **kwargs: Any) -> Group:
227 """
228 Alias for :func:`open_group` with ``use_consolidated=True``.
229 """
230 return Group(
--> 231 sync(async_api.open_consolidated(*args, use_consolidated=use_consolidated, **kwargs))
232 )
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/asynchronous.py:408, in open_consolidated(use_consolidated, *args, **kwargs)
403 if use_consolidated is not True:
404 raise TypeError(
405 "'use_consolidated' must be 'True' in 'open_consolidated'. Use 'open' with "
406 "'use_consolidated=False' to bypass consolidated metadata."
407 )
--> 408 return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/api/asynchronous.py:857, in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options, zarr_version, zarr_format, meta_array, attributes, use_consolidated)
855 try:
856 if mode in _READ_MODES:
--> 857 return await AsyncGroup.open(
858 store_path, zarr_format=zarr_format, use_consolidated=use_consolidated
859 )
860 except (KeyError, FileNotFoundError):
861 pass
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/core/group.py:559, in AsyncGroup.open(cls, store, zarr_format, use_consolidated)
552 raise FileNotFoundError(store_path)
553 elif zarr_format is None:
554 (
555 zarr_json_bytes,
556 zgroup_bytes,
557 zattrs_bytes,
558 maybe_consolidated_metadata_bytes,
--> 559 ) = await asyncio.gather(
560 (store_path / ZARR_JSON).get(),
561 (store_path / ZGROUP_JSON).get(),
562 (store_path / ZATTRS_JSON).get(),
563 (store_path / str(consolidated_key)).get(),
564 )
565 if zarr_json_bytes is not None and zgroup_bytes is not None:
566 # warn and favor v3
567 msg = f"Both zarr.json (Zarr format 3) and .zgroup (Zarr format 2) metadata objects exist at {store_path}. Zarr format 3 will be used."
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/storage/_common.py:168, in StorePath.get(self, prototype, byte_range)
166 if prototype is None:
167 prototype = default_buffer_prototype()
--> 168 return await self.store.get(self.path, prototype=prototype, byte_range=byte_range)
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/storage/_zip.py:183, in ZipStore.get(self, key, prototype, byte_range)
180 assert isinstance(key, str)
182 with self._lock:
--> 183 return self._get(key, prototype=prototype, byte_range=byte_range)
File /srv/conda/envs/notebook/lib/python3.11/site-packages/zarr/storage/_zip.py:156, in ZipStore._get(self, key, prototype, byte_range)
154 # docstring inherited
155 try:
--> 156 with self._zf.open(key) as f: # will raise KeyError
157 if byte_range is None:
158 return prototype.buffer.from_bytes(f.read())
File /srv/conda/envs/notebook/lib/python3.11/zipfile.py:1585, in ZipFile.open(self, name, mode, pwd, force_zip64)
1583 fheader = struct.unpack(structFileHeader, fheader)
1584 if fheader[_FH_SIGNATURE] != stringFileHeader:
-> 1585 raise BadZipFile("Bad magic number for file header")
1587 fname = zef_file.read(fheader[_FH_FILENAME_LENGTH])
1588 if fheader[_FH_EXTRA_FIELD_LENGTH]:
BadZipFile: Bad magic number for file headerAnything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 6.8.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.6
libnetcdf: 4.9.3
xarray: 2025.10.1
pandas: 2.3.3
numpy: 2.3.3
scipy: 1.16.2
netCDF4: 1.7.2
pydap: 3.5.8
h5netcdf: 1.6.4
h5py: 3.14.0
zarr: 3.1.3
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: 3.13.1
bottleneck: 1.6.0
dask: 2025.9.1
distributed: 2025.9.1
matplotlib: 3.10.6
cartopy: 0.25.0
seaborn: 0.13.2
numbagg: 0.9.3
fsspec: 2025.9.0
cupy: None
pint: 0.25
sparse: 0.17.0
flox: None
numpy_groupies: None
setuptools: 80.9.0
pip: 25.2
conda: None
pytest: None
mypy: None
IPython: 9.4.0
sphinx: None