-
Notifications
You must be signed in to change notification settings - Fork 2k
[Fixed] Run CI on a single machine by generating cloudbuild files #4186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fixed] Run CI on a single machine by generating cloudbuild files #4186
Conversation
Instead of running each CI test on a separate Google Cloud machine, generate a cloudbuild.yml file to test all affected packages on a single machine. The benifits of this strategy include the following: * Fewer machines are required for a single test (2 instead of up to 12), so queuing should happen much less often. * Packages only need to be built once, instead of once per machine. * Multithreaded steps have access to more cores. When CI is run, the root cloudbuild.yml is launched on Google Cloud. This file has only three steps: Install yarn dependencies, run `yarn generate-cloudbuild-for-packages`, and run `run-build.sh`. The second step finds what packages were affected by the change and combines their cloudbuild.yml files into a single `cloudbuild_generated.yml` file, and the third step launches the generated build on Google Cloud. The steps in the generated build are set to `waitFor` each other according to the dependency tree `scripts/package_dependencies.json`, and each package's `build-deps` step is removed in favor of waiting for its dependencies to be built. Each package's cloudbuild file still works as expected, and there are two new yarn commands at the root: * `yarn test-packages` tests all packages affected by a change. It performs the same steps as CI. * `yarn generate-cloudbuild-for-packages [PACKAGE]...` generates the cloudbuild file to test each `PACKAGE` and packages it might affect. If given no packages, it detects what packages have been changed in the current PR.
tfjs-node builds N-API bindings that end-to-end tests need to use. End-to-end uses our custom docker image (gcr.io/learnjs-174218/release) since it has to build tfjs-backend-wasm when it's run by itself (which requires emscripten). Our custom image uses node 12, and is incompatible with node 10 N-API bindings. Eventually, we'll want to move everything that's using node:10 to node:12, but that's out of the scope of this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!!! Very clean and thorough. Thank you Matt!
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @lina128, @mattsoulanille, @pyu10055, and @tafsiri)
scripts/generate_cloudbuild.js, line 103 at r1 (raw file):
// If there are no such nodes, then the graph has a cycle. if (emptyNodes.length === 0) { // TODO(msoulanille): Include the cycle in the error message?
Seems this TODO is implemented below, you can remove the TODO.
scripts/find_packages_with_diff.js, line 25 at r1 (raw file):
const filesAllowlistToTriggerBuild = [ 'cloudbuild.yml', 'package.json', 'tsconfig.json', 'tslint.json', 'scripts/find_packages_with_diff.js', 'scripts/run-build.sh'
Should we add 'scripts/generate_cloudbuild.js' to this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Na!
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @lina128, @pyu10055, and @tafsiri)
scripts/generate_cloudbuild.js, line 103 at r1 (raw file):
Previously, lina128 (Na Li) wrote…
Seems this TODO is implemented below, you can remove the TODO.
Done.
scripts/find_packages_with_diff.js, line 25 at r1 (raw file):
Previously, lina128 (Na Li) wrote…
Should we add 'scripts/generate_cloudbuild.js' to this list?
Done.
Example builds that have failed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @lina128, @mattsoulanille, @pyu10055, and @tafsiri)
scripts/cloudbuild_tfjs_core_expected.yml, line 95 at r2 (raw file):
waitFor: - yarn-common - yarn-tfjs-core
many of the waitFor seems to be duplicated? for example tfjs-backend-cpu waits for yarn-tfjs-backend-cpu, which has already waited for yarn-tfjs-core? Should these duplication be cleaned up?
tfjs-backend-cpu is listed in tfjs-data's package.json as a linked dependency, but was missing form package_dependencies.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @lina128, @pyu10055, and @tafsiri)
scripts/cloudbuild_tfjs_core_expected.yml, line 95 at r2 (raw file):
Previously, pyu10055 (Ping Yu) wrote…
many of the waitFor seems to be duplicated? for example tfjs-backend-cpu waits for yarn-tfjs-backend-cpu, which has already waited for yarn-tfjs-core? Should these duplication be cleaned up?
We could trim each step's dependencies down to the smallest set possible in generate_cloudbuild.js, but I think it's best to maintain a 1:1 relationship with the contents of package_dependencies.json, which is derived from the package.json files. It might be confusing if dependencies seemingly don't show up in a given step in the generated file, even if they're present transitively.
As a side note, we might eventually want to use the package.json files directly instead of keeping a separate 'package_dependencies.json' file up-to-date, but that can come in a later commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matt, it seems it stops when test-webgpu-tfjs-backend-webgpu starts. It build deps again in the test-ci.sh. Since we migrate to dynamic cloudbuild approach, all the build steps should be controlled by cloudbuild file, not by shell scripts. Can you refactor test-webgpu cloudbuild steps? Basically, moving yarn lint and yarn build-deps from test-ci.sh to cloudbuild.yml.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @lina128, @pyu10055, and @tafsiri)
Thanks so much, Na! I was looking for packages that were rebuilding dependencies, but I must have missed that one. I'll refactor it. |
Move tfjs-backend-webgpu's dependency building to the 'build-deps' step so it can be removed during CI. This commit also updates the test cloudbuild files that should have been updated in f9c1fdd. It would probably be best to amend that commit, but last time I tried force-pushing in GitHub, I broke the PR I was working on.
Just pushed the refactor. Individual cloudbuild tests for tfjs-backend-webgpu still work: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @lina128 and @tafsiri)
I think the current build failure might actually be a bug in tfjs-layers. I get the same error when I run the tfjs-layers cloudbuild by itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does tfjs-layers nightly build pass?
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @lina128 and @tafsiri)
FWIW i'm seeing the same error in layers on a different PR. |
The change triggering the failure in the layers tests has been reverted so you should be able to sync this with HEAD and see it pass. |
tfjs-backend-webgpu is now pinned at v2.7.0 for all its tfjs dependencies.
update-cloudbuild-tests updates the change detection tests for scripts/generate_cloudbuild.js. It should be run iff a change affects cloudbuild files or the dependency tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 7 files at r6.
Reviewable status:complete! 2 of 1 approvals obtained (waiting on @lina128, @mattsoulanille, and @tafsiri)
package.json, line 31 at r6 (raw file):
"test-release-notes": "ts-node -s ./scripts/release_notes/run_tests.ts", "update-tfjs-lockfiles": "ts-node -s ./scripts/update-tfjs-lockfiles", "update-cloudbuild-tests": "yarn generate-cloudbuild-for-packages tfjs-node && cp cloudbuild_generated.yml scripts/cloudbuild_tfjs_node_expected.yml && yarn generate-cloudbuild-for-packages tfjs-core && cp cloudbuild_generated.yml scripts/cloudbuild_tfjs_core_expected.yml"
Are these files all generated in the top level folder? I.e. why not have yarn generate-cloudbuild-for-packages tfjs-node
write the file into tfjs-node (and then you don't need the cp command)? I'm mainly thinking of the potential for overlapping state if someone is running this manually. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @lina128 and @tafsiri)
package.json, line 31 at r6 (raw file):
Previously, tafsiri (Yannick Assogba) wrote…
Are these files all generated in the top level folder? I.e. why not have
yarn generate-cloudbuild-for-packages tfjs-node
write the file into tfjs-node (and then you don't need the cp command)? I'm mainly thinking of the potential for overlapping state if someone is running this manually. What do you think?
generate_cloudbuild_for_packages must be run in the project's root directory, but I've added a flag to it to change the output location and name. update-cloudbuild-tests no longer overwrites cloudbuild_generated.yml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt! LGTM (reviewed the last two commits).
Reviewed 1 of 7 files at r6, 3 of 3 files at r7.
Reviewable status:complete! 3 of 1 approvals obtained (waiting on @lina128 and @tafsiri)
Thanks for the review, Yannick! I'll submit once the presubmits pass. |
The previous PR was reverted because it broke CI. This PR includes the previous PR's changes as a single commit and includes fixes in the commits following it.
Instead of running each CI test on a separate Google Cloud machine, generate a cloudbuild.yml file to test all affected packages on a single machine. The benifits of this strategy include the following:
When CI is run, the root cloudbuild.yml is launched on Google Cloud. This file has only three steps: Install yarn dependencies, run
yarn generate-cloudbuild-for-packages
, and runrun-build.sh
. The second step finds what packages were affected by the change and combines their cloudbuild.yml files into a singlecloudbuild_generated.yml
file, and the third step launches the generated build on Google Cloud. The steps in the generated build are set towaitFor
each other according to the dependency treescripts/package_dependencies.json
, and each package'sbuild-deps
step is removed in favor of waiting for its dependencies to be built.Each package's cloudbuild file still works as expected, and there are two new yarn commands at the root:
yarn test-packages
tests all packages affected by a change. It performs the same steps as CI.yarn generate-cloudbuild-for-packages [PACKAGE]...
generates the cloudbuild file to test eachPACKAGE
and packages it might affect. If given no packages, it detects what packages have been changed in the current PR.To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is