diff --git a/README.md b/README.md index 8e4e70a..65b8022 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,9 @@ operators for some of the contained resources are Kueue-aware, the AppWrapper operator ensures that when Kueue admits an AppWrapper for execution, all of the necessary information will be propagated to cause the child's Kueue-enabled operator to admit it as well. +For a more detailed description of the overall design, see the +[Architecture](https://project-codeflare.github.io/appwrapper/arch-controller/) +section of our website. AppWrappers are designed to harden workloads by providing an additional level of automatic fault detection and recovery. The AppWrapper @@ -23,225 +26,35 @@ the AppWrapper controller will orchestrate workload-level retries and resource deletion to ensure that either the workload returns to a healthy state or is cleanly removed from the cluster and its quota freed for use by other workloads. For details on customizing and -configuring these fault tolerance capabilities, please see -[fault_tolerance.md](docs/fault_tolerance.md). +configuring these fault tolerance capabilities, please see the +[Fault Tolerance](https://project-codeflare.github.io/appwrapper/arch-controller/) +section of our website. -## Description +## Installation -Kueue has a well-developed pattern for Kueue-enabling a Custom -Resource Definition and its associated operator. Following this pattern -allows the resulting operator to smoothly run alongside the core Kueue -operator. The pattern consists of three main elements: an Admission -Controller, a Workload Controller, and a Framework Controller. +To install the latest release of AppWrapper in a Kubernetes cluster with Kueue already installed +and configured, simply run the command: -#### Admission Controller - -Kueue requires the definition of an Admission Controller that ensures -that the `.spec.suspend` field of newly created AppWrapper instances is -set to true. We also leverage the Admission Controller to ensure that -the user creating the AppWrapper is also entitled to create the contained resources -and to validate AppWrapper-specific invariants. - -See [appwrapper_webhook.go](./internal/webhook/appwrapper_webhook.go) -for the implementation. - -#### Workload Controller - -An instantiation of Kueue’s GenericReconciller along with an -implementation of Kueue’s GenericJob interface for the AppWrapper -CRD. As is standard practice in Kueue, this controller will watch -AppWrapper instances and their owned Workload instances to reconcile -the two. This controller will make it possible for Kueue to suspend, -resume, and constrain the placement of the AppWrapper. It will report -the status of the AppWrapper to Kueue. - -See [workload_controller.go](./internal/controller/workload/workload_controller.go) -for the implementation. - -A small additional piece of logic is currently needed to generalize -Kueue's ability to recognize parent/children relationships and enforce -that admission by Kueue of the parent AppWrapper will be propagated to -its immediate children. - -See [child_admission_controller.go](./internal/controller/workload/child_admission_controller.go) -for the implementation. - -#### Framework Controller - -A standard reconciliation loop that watches AppWrapper instances and -is responsible for all AppWrapper-specific operations including -creating, monitoring, and deleting the wrapped resources in response -to the modifications of the AppWrapper instance’s specification and -status made by the Workload Controller described above. - -This [state transition diagram](docs/state-diagram.md) depicts the -lifecycle of an AppWrapper; the implementation is found in -[appwrapper_controller.go](./internal/controller/appwrapper/appwrapper_controller.go). - -## Getting Started - -### Prerequisites - -You'll need `go` v1.21.0+ installed on your development machine. - -You'll need a container runtime and cli (eg `docker` or `rancher-desktop`). - -You’ll need a Kubernetes cluster to run against. - -You can use [kind](https://sigs.k8s.io/kind) to get a local cluster -for testing, or run against a remote cluster. All commands shown in -this readme will automatically use the current context in your -kubeconfig file (i.e. whatever cluster `kubectl cluster-info` shows). - -For the purposes of simplifying the getting started documentation, we -proceed assuming you will create a local `kind` cluster. - -### Create your cluster and deploy Kueue - -Create the cluster with: ```sh -./hack/create-test-cluster.sh +kubectl apply --server-side -f https://github.com/project-codeflare/appwrapper/releases/download/v0.7.3/install.yaml ``` -Deploy Kueue on the cluster and configure it to have queues in your default namespace -with a nominal quota of 4 CPUs with: -```sh -./hack/deploy-kueue.sh -``` - -You can verify Kueue is configured as expected with: -```sh -% kubectl get localqueues,clusterqueues -o wide -NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS -localqueue.kueue.x-k8s.io/user-queue cluster-queue 0 0 +The controller runs in the `appwrapper-system` namespace. -NAME COHORT STRATEGY PENDING WORKLOADS ADMITTED WORKLOADS -clusterqueue.kueue.x-k8s.io/cluster-queue BestEffortFIFO 0 0 -``` - -### Deploy on the cluster - -Build your image and push it to the cluster with: -```sh -make docker-build kind-push -``` +Read the [Quick Start Guide](https://project-codeflare.github.io/appwrapper/quick-start/) to learn more. -Deploy the CRDs and controller to the cluster: -```sh -make deploy -``` +## Usage -Within a few seconds, the controller pod in the `appwrapper-system` -namespace should be Ready. Verify this with: -```sh -kubectl get pods -n appwrapper-system -``` +For example of AppWrapper usage, browse our [Samples](./samples) directory or +see the [Examples](https://project-codeflare.github.io/appwrapper/examples/) section +of the project website. -You can now try deploying a sample `AppWrapper`: -```sh -kubectl apply -f samples/appwrapper.yaml -``` +## Development -You should shortly see a Pod called `sample` running. -After running for 5 seconds, the Pod will complete and the -AppWrapper's status will be Succeeded. -```sh -% kubectl get appwrappers -NAME STATUS -sample Running -% kubectl get pods -NAME READY STATUS RESTARTS AGE -sample 1/1 Running 0 2s -% kubectl get pods -NAME READY STATUS RESTARTS AGE -sample 0/1 Completed 0 9s -% kubectl get appwrappers -NAME STATUS -sample Succeeded -``` - -You can now delete the sample AppWrapper. -```sh -kubectl delete -f samples/appwrapper.yaml -``` - -To undeploy the CRDs and controller from the cluster: -```sh -make undeploy -``` - -### Run the controller as a local process against the cluster - -For faster development and debugging, you can run the controller -directly on your development machine as local process that will -automatically be connected to the cluster. Note that in this -configuration, the webhooks that implement the Admission Controllers -are not operational. Therefore your CRDs will not be validated and -you must explictly set the `suspended` field to `true` in your -AppWrapper YAML files. - -Install the CRDs into the cluster: - -```sh -make install -``` - -Run your controller (this will run in the foreground, so switch to a new terminal if you want to leave it running): -```sh -make run -``` - -**NOTE:** You can also run this in one step by running: `make install run` - -You can now deploy a sample with `kubectl apply -f -samples/appwrapper.yaml` and observe its execution as described -above. - -After deleting all AppWrapper CR instances, you can uninstall the CRDs -with: -```sh -make uninstall -``` - -## Contributing - -### Pre-commit hooks - -This repository includes pre-configured pre-commit hooks. Make sure to install -the hooks immediately after cloning the repository: -```sh -pre-commit install -``` -See [https://pre-commit.com](https://pre-commit.com) for prerequisites. - -### Running unit tests - -Unit tests can be run at any time by doing `make test`. -No additional setup is required. - -### Running end-to-end tests - -A suite of end-to-end tests are run as part of the project's -[continuous intergration workflow](./.github/workflows/CI.yaml). -These tests can also be run locally aginst a deployed version of Kueue -and the AppWrapper controller. - -To create and initialize your cluster, perform the following steps: -```shell -./hack/create-test-cluster.sh -./hack/deploy-kueue.sh -``` - -Next build and deploy the AppWrapper operator -```shell -make docker-build kind-push -make deploy -``` - -Finally, run the test suite -```shell -./hack/run-tests-on-cluster.sh -``` +To contribute to the AppWrapper project and for detailed instructions on how to +build and deploy the project from source, see the +[Development Setup](https://project-codeflare.github.io/appwrapper/dev-setup/) section +of the project website. ## License diff --git a/docs/fault_tolerance.md b/docs/fault_tolerance.md deleted file mode 100644 index 4253b42..0000000 --- a/docs/fault_tolerance.md +++ /dev/null @@ -1,71 +0,0 @@ -## Fault Tolerance - -### Overall Design - -The `podSets` contained in the AppWrapper specification enable the AppWrapper -controller to inject labels into every `Pod` that is created by -the workload during its execution. Throughout the execution of the -workload, the AppWrapper controller monitors the number and health of -all labeled `Pods` and uses this information to determine if a -workload is unhealthy. A workload can be deemed *unhealthy* either -because it contains a non-zero number of `Failed` pods or because -after the `WarmupGracePeriod` has passed and it has fewer -`Running` and `Completed` pods than expected. - -If a workload is determined to be unhealthy, the AppWrapper controller -first waits for a `FailureGracePeriod` to allow the primary resource -controller an opportunity to react and return the workload to a -healthy state. If the `FailureGracePeriod` expires, the AppWrapper -controller will *reset* the workload by deleting its resources, waiting -for a `ResetPause`, and then creating new instances of the resources. -During this reset period, the AppWrapper **does not** release the workload's -quota; this ensures that when the resources are recreated they will still -have sufficient quota to execute. The number of times an AppWrapper is reset -is tracked as part of its status; if the number of resets exceeds the `RetryLimit`, -then the AppWrapper moves into a `Failed` state and its resources are deleted -(thus finally releasing its quota). If at any time during this retry loop, -an AppWrapper is suspended (ie, Kueue decides to preempt the AppWrapper), -the AppWrapper controller will respect this request by proceeding to delete -the resources - -When the AppWrapper controller decides to delete the resources for a workload, -it proceeds through several phases. First it does a normal delete of the -resources, allowing the primary resource controllers time to cascade the deletion -through all child resources. During a `DeletionGracePeriod`, the AppWrapper controller -monitors to see if the primary controllers have managed to successfully delete -all of the workload's Pods and resources. If they fail to accomplish this within -the `DeletionGracePeriod`, the AppWrapper controller then initiates a *forceful* -deletion of all remaining Pods and resources by deleting them with a `GracePeriod` of `0`. -An AppWrapper will continue to have its `ResourcesDeployed` condition to be -`True` until all resources and Pods are successfully deleted. - -This process ensures that when `ResourcesDeployed` becomes `False`, which -indicates to Kueue that the quota has been released, all resources created by -a failed workload will have been totally removed from the cluster. - -### Configuration Details - -The parameters of the retry loop described about are configured at the operator level -and can be customized on a per-AppWrapper basis by adding annotations. -The table below lists the parameters, gives their default, and the annotation that -can be used to customize them. - -| Parameter | Default Value | Annotation | -|---------------------|---------------|---------------------------------------------------------------| -| WarmupGracePeriod | 5 Minutes | workload.codeflare.dev.appwrapper/warmupGracePeriodDuration | -| FailureGracePeriod | 1 Minute | workload.codeflare.dev.appwrapper/failureGracePeriodDuration | -| ResetPause | 90 Seconds | workload.codeflare.dev.appwrapper/resetPauseDuration | -| RetryLimit | 3 | workload.codeflare.dev.appwrapper/retryLimit | -| DeletionGracePeriod | 10 Minutes | workload.codeflare.dev.appwrapper/deletionGracePeriodDuration | -| GracePeriodCeiling | 24 Hours | Not Applicable | - -The `GracePeriodCeiling` imposes an upper limit on the other grace periods to -reduce the impact of user-added annotations on overall system utilization. - -To support debugging `Failed` workloads, an additional annotation -`workload.codeflare.dev.appwrapper/debuggingFailureDeletionDelayDuration` can -be added to an AppWrapper when it is created to add a delay between the time the -AppWrapper enters the `Failed` state and when the process of deleting its resources -begins. Since the AppWrapper continues to consume quota during this delayed deletion period, -this annotation should be used sparingly and only when interactive debugging of -the failed workload is being actively pursued. diff --git a/docs/release_instructions.md b/docs/release_instructions.md index c1e6531..7f8429e 100644 --- a/docs/release_instructions.md +++ b/docs/release_instructions.md @@ -9,8 +9,7 @@ will: + generate the install.yaml for the release + create a [GitHub release](https://github.com/project-codeflare/appwrapper/releases) that contains the install.yaml -After the release process completes, update the -`appwrapper_version` and `kueue_version` variables in -[_config.yaml](../site/_config.yaml) and commit the changes to -update the installation instructions on the project web site to -refer to the latest released version. +After the automated release process completes, do a followup PR containing the +following updates to the main README and project website: + + Update the AppWrapper version number in the installation section of [README.md](../README.md#Installation). + + Update the `appwrapper_version` and `kueue_version` variables in [_config.yaml](../site/_config.yaml). diff --git a/docs/state-diagram.md b/docs/state-diagram.md deleted file mode 100644 index fbee1a6..0000000 --- a/docs/state-diagram.md +++ /dev/null @@ -1,60 +0,0 @@ -# AppWrapper State Diagram - -The state diagram below describes the transitions between the Phases of an AppWrapper. These states are augmented by two orthogonal conditions: - + QuotaReserved indicates whether the AppWrapper is considered Active by Kueue. - + ResourcesDeployed indicates whether wrapped resources may exist on the cluster. - -QuotaReserved and ResourcesDeployed are both true in states colored blue below. - -QuotaReserved and ResourcesDeployed will initially be true in the Failed state (pink), -but will become false when the controller succeeds at deleting the resources created -in the Resuming phase. - -ResourcesDeployed will be true in the Succeeded state (green), but QuotaReserved will be false. - -Any phase may transition to the Terminating phase (not shown) when the AppWrapper is deleted. -During the Terminating phase, QuotaReserved and ResourcesDeployed may initially be true -but will become false once the controller succeeds at deleting any associated resources. - -```mermaid -stateDiagram-v2 - e : Empty - - sd : Suspended - rs : Resuming - rn : Running - rt : Resetting - sg : Suspending - s : Succeeded - f : Failed - - %% Happy Path - e --> sd - sd --> rs : Suspend == false - rs --> rn - rn --> s - - %% Requeuing - rs --> sg : Suspend == true - rn --> sg : Suspend == true - rt --> sg : Suspend == true - sg --> sd - - %% Failures - rs --> f - rn --> f - rn --> rt : Workload Unhealthy - rt --> rs - - classDef quota fill:lightblue - class rs quota - class rn quota - class rt quota - class sg quota - - classDef failed fill:pink - class f failed - - classDef succeeded fill:lightgreen - class s succeeded -``` diff --git a/internal/controller/appwrapper/appwrapper_controller.go b/internal/controller/appwrapper/appwrapper_controller.go index 8369fcb..d7d4f4a 100644 --- a/internal/controller/appwrapper/appwrapper_controller.go +++ b/internal/controller/appwrapper/appwrapper_controller.go @@ -79,7 +79,7 @@ type podStatusSummary struct { // Reconcile reconciles an appwrapper // Please see [aw-states] for documentation of this method. // -// [aw-states]: https://github.com/project-codeflare/appwrapper/blob/main/docs/state-diagram.md +// [aw-states]: https://project-codeflare.github.io/appwrapper/arch-controller/#framework-controller // //gocyclo:ignore func (r *AppWrapperReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { diff --git a/site/README.md b/site/README.md new file mode 100644 index 0000000..5422ccc --- /dev/null +++ b/site/README.md @@ -0,0 +1,7 @@ +We use Jekyll to generate static html that can be served as a GitHub page for the project. + +The GitHub action [jekyll-gh-pages](.github/workflows/jekyll-gh-pages.yml) runs +whenever a change to the `_site` directory is pushed to the main branch. + +To host the website locally, you need a a Ruby 3.1 environment. Then in this +directory do `bundle install` followed by `bundle exec jekyll serve`. diff --git a/site/_config.yml b/site/_config.yml index b2e08cd..87ac53a 100644 --- a/site/_config.yml +++ b/site/_config.yml @@ -32,6 +32,9 @@ kueue_version: v0.6.1 permalink: /:categories/:title/ timezone: America/New_York +exclude: +- README.md + include: - _pages diff --git a/site/_data/navigation.yml b/site/_data/navigation.yml index d569531..55df4aa 100644 --- a/site/_data/navigation.yml +++ b/site/_data/navigation.yml @@ -3,6 +3,8 @@ main: url: / - title: "Quick Start Guide" url: /quick-start/ +- title: "Samples" + url: /samples/ side: - title: "Installation" @@ -12,12 +14,12 @@ side: - title: "Development Setup" url: /dev-setup/ -- title: "Examples" +- title: "Samples" children: - title: "PyTorch Job" - url: "/sample-pytorch/" + url: "/samples/pytorch/" - title: "Batch Job" - url: "/sample-batch-job/" + url: "/samples/batch-job/" - title: "Architecture" children: diff --git a/site/_pages/dev-setup.md b/site/_pages/dev-setup.md index 1f64d93..c9bb482 100644 --- a/site/_pages/dev-setup.md +++ b/site/_pages/dev-setup.md @@ -13,15 +13,22 @@ You'll need a container runtime and cli (eg `docker` or `rancher-desktop`). You’ll need a Kubernetes cluster to run against. You can use [kind](https://sigs.k8s.io/kind) to get a local cluster -for testing, or run against a remote cluster. All commands shown in -this readme will automatically use the current context in your -kubeconfig file (i.e. whatever cluster `kubectl cluster-info` shows). +for testing, or run against a remote cluster. For the purposes of +simplifying the rest of these instructions, we proceed assuming you +will create a local `kind` cluster. -For the purposes of simplifying the getting started documentation, we -proceed assuming you will create a local `kind` cluster. +### Pre-commit hooks + +This repository includes pre-configured pre-commit hooks. Make sure to install +the hooks immediately after cloning the repository: +```sh +pre-commit install +``` +See [https://pre-commit.com](https://pre-commit.com) for prerequisites. ### Create your cluster and deploy Kueue + Create the cluster with: ```sh ./hack/create-test-cluster.sh @@ -127,27 +134,6 @@ with: make uninstall ``` - - - - - - - - - - - - -### Pre-commit hooks - -This repository includes pre-configured pre-commit hooks. Make sure to install -the hooks immediately after cloning the repository: -```sh -pre-commit install -``` -See [https://pre-commit.com](https://pre-commit.com) for prerequisites. - ### Running unit tests Unit tests can be run at any time by doing `make test`. diff --git a/site/_pages/sample-batch-job.md b/site/_pages/sample-batch-job.md index 7b0480c..c0fc44d 100644 --- a/site/_pages/sample-batch-job.md +++ b/site/_pages/sample-batch-job.md @@ -1,6 +1,6 @@ --- -permalink: /sample-batch-job/ -title: "AppWrapper Batch Job" +permalink: /samples/batch-job/ +title: "AppWrapper Containing a Batch Job" classes: wide --- diff --git a/site/_pages/sample-pytorch.md b/site/_pages/sample-pytorch.md index c9cd578..fef744d 100644 --- a/site/_pages/sample-pytorch.md +++ b/site/_pages/sample-pytorch.md @@ -1,5 +1,5 @@ --- -permalink: /sample-pytorch/ +permalink: /samples/pytorch/ title: "AppWrapper Containing PyTorch Job" classes: wide --- diff --git a/site/_pages/samples.md b/site/_pages/samples.md new file mode 100644 index 0000000..e5a175e --- /dev/null +++ b/site/_pages/samples.md @@ -0,0 +1,9 @@ +--- +permalink: /samples/ +title: "AppWrapper Samples" +classes: wide +--- + +Available AppWrapper Samples + + [Kubernetes Batch Job](/samples/batch-job) + + [PyTorch Job](/samples/pytorch)