Skip to content

Conversation

mattsoulanille
Copy link
Member

@mattsoulanille mattsoulanille commented Nov 4, 2020

The previous PR was reverted because it broke CI. This PR includes the previous PR's changes as a single commit and includes fixes in the commits following it.

Instead of running each CI test on a separate Google Cloud machine, generate a cloudbuild.yml file to test all affected packages on a single machine. The benifits of this strategy include the following:

  • Fewer machines are required for a single test (2 instead of up to 12), so queuing should happen much less often.
  • Packages only need to be built once, instead of once per machine.
  • Multithreaded steps have access to more cores.

When CI is run, the root cloudbuild.yml is launched on Google Cloud. This file has only three steps: Install yarn dependencies, run yarn generate-cloudbuild-for-packages, and run run-build.sh. The second step finds what packages were affected by the change and combines their cloudbuild.yml files into a single cloudbuild_generated.yml file, and the third step launches the generated build on Google Cloud. The steps in the generated build are set to waitFor each other according to the dependency tree scripts/package_dependencies.json, and each package's build-deps step is removed in favor of waiting for its dependencies to be built.

Each package's cloudbuild file still works as expected, and there are two new yarn commands at the root:

  • yarn test-packages tests all packages affected by a change. It performs the same steps as CI.
  • yarn generate-cloudbuild-for-packages [PACKAGE]... generates the cloudbuild file to test each PACKAGE and packages it might affect. If given no packages, it detects what packages have been changed in the current PR.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

Instead of running each CI test on a separate Google Cloud machine, generate a cloudbuild.yml file to test all affected packages on a single machine. The benifits of this strategy include the following:
* Fewer machines are required for a single test (2 instead of up to 12), so queuing should happen much less often.
* Packages only need to be built once, instead of once per machine.
* Multithreaded steps have access to more cores.

When CI is run, the root cloudbuild.yml is launched on Google Cloud. This file has only three steps: Install yarn dependencies, run `yarn generate-cloudbuild-for-packages`, and run `run-build.sh`. The second step finds what packages were affected by the change and combines their cloudbuild.yml files into a single `cloudbuild_generated.yml` file, and the third step launches the generated build on Google Cloud. The steps in the generated build are set to `waitFor` each other according to the dependency tree `scripts/package_dependencies.json`, and each package's `build-deps` step is removed in favor of waiting for its dependencies to be built.

Each package's cloudbuild file still works as expected, and there are two new yarn commands at the root:
* `yarn test-packages` tests all packages affected by a change. It performs the same steps as CI.
* `yarn generate-cloudbuild-for-packages [PACKAGE]...` generates the cloudbuild file to test each `PACKAGE` and packages it might affect. If given no packages, it detects what packages have been changed in the current PR.
tfjs-node builds N-API bindings that end-to-end tests need to use. End-to-end uses our custom docker image (gcr.io/learnjs-174218/release) since it has to build tfjs-backend-wasm when it's run by itself (which requires emscripten). Our custom image uses node 12, and is incompatible with node 10 N-API bindings. Eventually, we'll want to move everything that's using node:10 to node:12, but that's out of the scope of this commit.
Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!!! Very clean and thorough. Thank you Matt!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @mattsoulanille, @pyu10055, and @tafsiri)


scripts/generate_cloudbuild.js, line 103 at r1 (raw file):

    // If there are no such nodes, then the graph has a cycle.
    if (emptyNodes.length === 0) {
      // TODO(msoulanille): Include the cycle in the error message?

Seems this TODO is implemented below, you can remove the TODO.


scripts/find_packages_with_diff.js, line 25 at r1 (raw file):

const filesAllowlistToTriggerBuild = [
  'cloudbuild.yml', 'package.json', 'tsconfig.json', 'tslint.json',
  'scripts/find_packages_with_diff.js', 'scripts/run-build.sh'

Should we add 'scripts/generate_cloudbuild.js' to this list?

Copy link
Member Author

@mattsoulanille mattsoulanille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Na!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @pyu10055, and @tafsiri)


scripts/generate_cloudbuild.js, line 103 at r1 (raw file):

Previously, lina128 (Na Li) wrote…

Seems this TODO is implemented below, you can remove the TODO.

Done.


scripts/find_packages_with_diff.js, line 25 at r1 (raw file):

Previously, lina128 (Na Li) wrote…

Should we add 'scripts/generate_cloudbuild.js' to this list?

Done.

@mattsoulanille
Copy link
Member Author

test-browser in tfjs-layers sometimes fails to parse tfjs-core/dist/tf-core.node.js. On one run, it has said that a comment is not terminated, and in another, that a string is not terminated. Perhaps tfjs-core/dist/tf-core.node.js is being written while the test is being run, but I've looked through all the in-progress steps from the following builds and found nothing that would be writing to it.

Example builds that have failed:

https://console.cloud.google.com/cloud-build/builds/6b149616-e500-48b8-8f96-f5c54a4ba8c9?project=834911136599

https://console.cloud.google.com/cloud-build/builds/efd36d73-321a-4598-936b-d7dba2e0a713?project=834911136599

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @mattsoulanille, @pyu10055, and @tafsiri)


scripts/cloudbuild_tfjs_core_expected.yml, line 95 at r2 (raw file):

    waitFor:
      - yarn-common
      - yarn-tfjs-core

many of the waitFor seems to be duplicated? for example tfjs-backend-cpu waits for yarn-tfjs-backend-cpu, which has already waited for yarn-tfjs-core? Should these duplication be cleaned up?

tfjs-backend-cpu is listed in tfjs-data's package.json as a linked dependency, but was missing form package_dependencies.json.
Copy link
Member Author

@mattsoulanille mattsoulanille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @pyu10055, and @tafsiri)


scripts/cloudbuild_tfjs_core_expected.yml, line 95 at r2 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

many of the waitFor seems to be duplicated? for example tfjs-backend-cpu waits for yarn-tfjs-backend-cpu, which has already waited for yarn-tfjs-core? Should these duplication be cleaned up?

We could trim each step's dependencies down to the smallest set possible in generate_cloudbuild.js, but I think it's best to maintain a 1:1 relationship with the contents of package_dependencies.json, which is derived from the package.json files. It might be confusing if dependencies seemingly don't show up in a given step in the generated file, even if they're present transitively.

As a side note, we might eventually want to use the package.json files directly instead of keeping a separate 'package_dependencies.json' file up-to-date, but that can come in a later commit.

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt, it seems it stops when test-webgpu-tfjs-backend-webgpu starts. It build deps again in the test-ci.sh. Since we migrate to dynamic cloudbuild approach, all the build steps should be controlled by cloudbuild file, not by shell scripts. Can you refactor test-webgpu cloudbuild steps? Basically, moving yarn lint and yarn build-deps from test-ci.sh to cloudbuild.yml.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @lina128, @pyu10055, and @tafsiri)

@mattsoulanille
Copy link
Member Author

Thanks so much, Na! I was looking for packages that were rebuilding dependencies, but I must have missed that one. I'll refactor it.

Move tfjs-backend-webgpu's dependency building to the 'build-deps' step so it can be removed during CI.

This commit also updates the test cloudbuild files that should have been updated in f9c1fdd. It would probably be best to amend that commit, but last time I tried force-pushing in GitHub, I broke the PR I was working on.
@mattsoulanille
Copy link
Member Author

Just pushed the refactor. Individual cloudbuild tests for tfjs-backend-webgpu still work:

https://console.cloud.google.com/cloud-build/builds/6cefa934-740d-4058-86f5-ade5d2d5d470?project=834911136599

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @lina128 and @tafsiri)

@mattsoulanille
Copy link
Member Author

I think the current build failure might actually be a bug in tfjs-layers. I get the same error when I run the tfjs-layers cloudbuild by itself.

Copy link
Collaborator

@lina128 lina128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does tfjs-layers nightly build pass?

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @lina128 and @tafsiri)

@tafsiri
Copy link
Contributor

tafsiri commented Nov 6, 2020

FWIW i'm seeing the same error in layers on a different PR.

@tafsiri
Copy link
Contributor

tafsiri commented Nov 10, 2020

The change triggering the failure in the layers tests has been reverted so you should be able to sync this with HEAD and see it pass.

tfjs-backend-webgpu is now pinned at v2.7.0 for all its tfjs dependencies.
update-cloudbuild-tests updates the change detection tests for scripts/generate_cloudbuild.js. It should be run iff a change affects cloudbuild files or the dependency tree.
Copy link
Contributor

@tafsiri tafsiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 7 files at r6.
Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @lina128, @mattsoulanille, and @tafsiri)


package.json, line 31 at r6 (raw file):

    "test-release-notes": "ts-node -s ./scripts/release_notes/run_tests.ts",
    "update-tfjs-lockfiles": "ts-node -s ./scripts/update-tfjs-lockfiles",
    "update-cloudbuild-tests": "yarn generate-cloudbuild-for-packages tfjs-node && cp cloudbuild_generated.yml scripts/cloudbuild_tfjs_node_expected.yml && yarn generate-cloudbuild-for-packages tfjs-core && cp cloudbuild_generated.yml scripts/cloudbuild_tfjs_core_expected.yml"

Are these files all generated in the top level folder? I.e. why not have yarn generate-cloudbuild-for-packages tfjs-node write the file into tfjs-node (and then you don't need the cp command)? I'm mainly thinking of the potential for overlapping state if someone is running this manually. What do you think?

Copy link
Member Author

@mattsoulanille mattsoulanille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @lina128 and @tafsiri)


package.json, line 31 at r6 (raw file):

Previously, tafsiri (Yannick Assogba) wrote…

Are these files all generated in the top level folder? I.e. why not have yarn generate-cloudbuild-for-packages tfjs-node write the file into tfjs-node (and then you don't need the cp command)? I'm mainly thinking of the potential for overlapping state if someone is running this manually. What do you think?

generate_cloudbuild_for_packages must be run in the project's root directory, but I've added a flag to it to change the output location and name. update-cloudbuild-tests no longer overwrites cloudbuild_generated.yml.

Copy link
Contributor

@tafsiri tafsiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt! LGTM (reviewed the last two commits).

Reviewed 1 of 7 files at r6, 3 of 3 files at r7.
Reviewable status: :shipit: complete! 3 of 1 approvals obtained (waiting on @lina128 and @tafsiri)

@mattsoulanille
Copy link
Member Author

Thanks for the review, Yannick! I'll submit once the presubmits pass.

@mattsoulanille mattsoulanille merged commit 7bd157e into tensorflow:master Nov 10, 2020
@mattsoulanille mattsoulanille deleted the generate_cloud_build branch November 10, 2020 19:15
@mattsoulanille mattsoulanille restored the generate_cloud_build branch November 10, 2020 22:11
mattsoulanille added a commit that referenced this pull request Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants