-
Notifications
You must be signed in to change notification settings - Fork 24
Flaky test fixes #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Flaky test fixes #580
Conversation
MCK 1.6.0 Release NotesNew Features
Bug Fixes
Other Changes
|
m1kola
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending.
Great observation! I think we should address this root cause instead of hacking around and making tests pass.
It will likely take more time so I suggest to separate this work from other flake fixes (where we rightfully increase timeouts).
Summary
This PR aims to reduce the flakiness of the following tests:
e2e_multi_cluster_sharded_snippets
Increased the timeout of
test_running, since in failing runs, by the time the diagnostics are collected, the resources become ready.e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck
Increased the timeout of
test_scale_appdb. Similarly, the assertion on appdb status fails, but by the time diagnostics are collected, the resource becomes ready.e2e_appdb_tls_operator_upgrade_v1_32_to_mck
In this test we have a race condition.
There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending. In that moment, the test will very quickly pass both assertions and move on to assert healthiness by connecting to OM. This will fail since OM was not actually ready.
To fix this, I added a
persist_forflag in our assertion methods. This makes sure that the phase we are currently asserting is reached and persists for a number of retries.Proof of Work
Retried the above tests a few times, and all pass
https://spruce.mongodb.com/version/6911c25146ed0e00077796e3/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC
Checklist
skip-changeloglabel if not needed