Flaky test fixes #580

lucian-tosa · 2025-11-10T10:45:35Z

Summary

This PR aims to reduce the flakiness of the following tests:

e2e_multi_cluster_sharded_snippets

Increased the timeout of test_running, since in failing runs, by the time the diagnostics are collected, the resources become ready.

e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck

Increased the timeout of test_scale_appdb. Similarly, the assertion on appdb status fails, but by the time diagnostics are collected, the resource becomes ready.

e2e_appdb_tls_operator_upgrade_v1_32_to_mck

In this test we have a race condition.

om-appdb-upgrade-tls   1          7.0.18    Running              Pending         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Running              Running         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         18m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         18m
om-appdb-upgrade-tls   1          7.0.18    Running              Running         Disabled         19m

There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending. In that moment, the test will very quickly pass both assertions and move on to assert healthiness by connecting to OM. This will fail since OM was not actually ready.

Reaching phase Running for resource AppDbStatus took 216.2561867237091s
Reaching phase Running for resource OmStatus took 0.0025169849395751953s

To fix this, I added a persist_for flag in our assertion methods. This makes sure that the phase we are currently asserting is reached and persists for a number of retries.

Proof of Work

Retried the above tests a few times, and all pass
https://spruce.mongodb.com/version/6911c25146ed0e00077796e3/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-11-10T10:46:42Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.0 Release Notes

New Features

MongoDBCommunity: Added support to configure custom cluster domain via newly introduced spec.clusterDomain resource field. If spec.clusterDomain is not set, environment variable CLUSTER_DOMAIN is used as cluster domain. If the environment variable CLUSTER_DOMAIN is also not set, operator falls back to cluster.local as default cluster domain.
Helm Chart: Introduced two new helm fields operator.podSecurityContext and operator.securityContext that can be used to configure securityContext for Operator deployment through Helm Chart.
MongoDBSearch: Switch to gRPC and mTLS for internal communication
Since MCK 1.4 the mongod and mongot processess communicated using the MongoDB Wire Protocol and used keyfile authentication. This release switches that to gRPC with mTLS authentication. gRPC will allow for load-balancing search queries against multiple mongot processes in the future, and mTLS decouples the internal cluster authentication mode and credentials among mongod processes from the connection to the mongot process. The Operator will automatically enable gRPC for existing and new workloads, and will enable mTLS authentication if both Database Server and MongoDBSearch resource are configured for TLS.
MongoDBSearch: MongoDB deployments using X509 internal cluster authentication are now supported. Previously MongoDB Search required SCRAM authentication among members of a MongoDB replica set. Note: SCRAM client authentication is still required, this change merely relaxes the requirements on internal cluster authentication.
MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.55.0. This is the version MCK uses if .spec.version is not specified.

Bug Fixes

Fixed parsing of the customEnvVars Helm value when values contain = characters.
ReplicaSet: Blocked disabling TLS and changing member count simultaneously. These operations must now be applied separately to prevent configuration inconsistencies.
Fixed inability to specify cluster-wide privileges in custom roles.

Other Changes

Simplified MongoDB Search setup: Removed the custom Search Coordinator polyfill (a piece of compatibility code previously needed to add the required permissions), as MongoDB 8.2.0 and later now include the necessary permissions via the built-in searchCoordinator role.
kubectl-mongodb plugin: cosign, the signing tool that is used to sign kubectl-mongodb plugin binaries, has been updated to version 3.0.2. With this change, released binaries will be bundled with .bundle files containing both signature and certificate information. For more information on how to verify signatures using new cosign version please refer to -> https://github.com/sigstore/cosign/blob/v3.0.2/doc/cosign_verify-blob.md

m1kola

There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending.

Great observation! I think we should address this root cause instead of hacking around and making tests pass.

It will likely take more time so I suggest to separate this work from other flake fixes (where we rightfully increase timeouts).

lucian-tosa added 3 commits November 10, 2025 11:39

Add persist_for in phase assertions

2000bcf

Increase timeouts

b3201cf

Precommit

4c1dd1b

lucian-tosa changed the title ~~Lucian/flaky test fixes~~ Flaky test fixes Nov 10, 2025

lucian-tosa added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Nov 10, 2025

lucian-tosa marked this pull request as ready for review November 10, 2025 13:55

lucian-tosa requested a review from a team as a code owner November 10, 2025 13:55

lucian-tosa requested review from anandsyncs and m1kola November 10, 2025 13:55

m1kola requested changes Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flaky test fixes #580

Flaky test fixes #580

Uh oh!

lucian-tosa commented Nov 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

m1kola left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flaky test fixes #580

Are you sure you want to change the base?

Flaky test fixes #580

Uh oh!

Conversation

lucian-tosa commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

e2e_multi_cluster_sharded_snippets

e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck

e2e_appdb_tls_operator_upgrade_v1_32_to_mck

Proof of Work

Checklist

Uh oh!

github-actions bot commented Nov 10, 2025

MCK 1.6.0 Release Notes

New Features

Bug Fixes

Other Changes

Uh oh!

m1kola left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucian-tosa commented Nov 10, 2025 •

edited

Loading