Reformat weather datasets into zarr.
See the dataset integration guide to integrate a new dataset to be reformatted.
We use
- uvto manage dependencies and python environments
- rufffor linting and formatting
- mypyfor type checking
- pytestfor testing
- pre-committo automatically lint and format as you git commit
- Install uv
- Run uv run pre-commit installto setup the git hooks
- If you use VSCode, you may want to install the extensions (ruff, mypy) it will recommend when you open this folder
- uv run main --help
- uv run main <DATASET_ID> update-template
- uv run main <DATASET_ID> backfill-local <INIT_TIME_END>
- Add dependency: uv add <package> [--dev]. Use--devto add a development only dependency.
- Lint: uv run ruff check
- Type check: uv run mypy
- Format: uv run ruff format
- Tests:
- Run tests in parallel on all available cores: uv run pytest
- Run tests serially: uv run pytest -n 0
 
- Run tests in parallel on all available cores: 
To reformat a large archive we parallelize work across multiple cloud servers.
We use
- dockerto package the code and dependencies
- kubernetesindexed jobs to run work in parallel
- Install dockerandkubectl. Make suredockercan be found at /usr/bin/docker andkubectlat /usr/bin/kubectl.
- Setup a docker image repository and export the DOCKER_REPOSITORY environment variable in your local shell. eg. export DOCKER_REPOSITORY=us-central1-docker.pkg.dev/<project-id>/reformatters/main
- Setup a kubernetes cluster and configure kubectl to point to your cluster. eg gcloud container clusters get-credentials <cluster-name> --region <region> --project <project>
- Create a kubectl secret containing your Source Coop S3 credentials kubectl create secret generic source-coop-storage-options-key --from-literal=contents='{"key": "...", "secret": "..."}'.
- `DYNAMICAL_ENV=prod uv run main <DATASET_ID> backfill-kubernetes <INIT_TIME_END> <JOBS_PER_POD> <MAX_PARALLELISM>