Skip to content

Conversation

@lucian-tosa
Copy link
Contributor

@lucian-tosa lucian-tosa commented Nov 10, 2025

Summary

This PR aims to reduce the flakiness of the following tests:

e2e_multi_cluster_sharded_snippets

Increased the timeout of test_running, since in failing runs, by the time the diagnostics are collected, the resources become ready.

e2e_multi_cluster_appdb_upgrade_downgrade_v1_27_to_mck

Increased the timeout of test_scale_appdb. Similarly, the assertion on appdb status fails, but by the time diagnostics are collected, the resource becomes ready.

e2e_appdb_tls_operator_upgrade_v1_32_to_mck

In this test we have a race condition.

om-appdb-upgrade-tls   1          7.0.18    Running              Pending         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Running              Running         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         17m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         18m
om-appdb-upgrade-tls   1          7.0.18    Pending              Running         Disabled         18m
om-appdb-upgrade-tls   1          7.0.18    Running              Running         Disabled         19m

There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending. In that moment, the test will very quickly pass both assertions and move on to assert healthiness by connecting to OM. This will fail since OM was not actually ready.

Reaching phase Running for resource AppDbStatus took 216.2561867237091s
Reaching phase Running for resource OmStatus took 0.0025169849395751953s

To fix this, I added a persist_for flag in our assertion methods. This makes sure that the phase we are currently asserting is reached and persists for a number of retries.

Proof of Work

Retried the above tests a few times, and all pass
https://spruce.mongodb.com/version/6911c25146ed0e00077796e3/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@lucian-tosa lucian-tosa changed the title Lucian/flaky test fixes Flaky test fixes Nov 10, 2025
@github-actions
Copy link

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.0 Release Notes

New Features

  • MongoDBCommunity: Added support to configure custom cluster domain via newly introduced spec.clusterDomain resource field. If spec.clusterDomain is not set, environment variable CLUSTER_DOMAIN is used as cluster domain. If the environment variable CLUSTER_DOMAIN is also not set, operator falls back to cluster.local as default cluster domain.
  • Helm Chart: Introduced two new helm fields operator.podSecurityContext and operator.securityContext that can be used to configure securityContext for Operator deployment through Helm Chart.
  • MongoDBSearch: Switch to gRPC and mTLS for internal communication
    Since MCK 1.4 the mongod and mongot processess communicated using the MongoDB Wire Protocol and used keyfile authentication. This release switches that to gRPC with mTLS authentication. gRPC will allow for load-balancing search queries against multiple mongot processes in the future, and mTLS decouples the internal cluster authentication mode and credentials among mongod processes from the connection to the mongot process. The Operator will automatically enable gRPC for existing and new workloads, and will enable mTLS authentication if both Database Server and MongoDBSearch resource are configured for TLS.
  • MongoDBSearch: MongoDB deployments using X509 internal cluster authentication are now supported. Previously MongoDB Search required SCRAM authentication among members of a MongoDB replica set. Note: SCRAM client authentication is still required, this change merely relaxes the requirements on internal cluster authentication.
  • MongoDBSearch: Updated the default mongodb/mongodb-search image version to 0.55.0. This is the version MCK uses if .spec.version is not specified.

Bug Fixes

  • Fixed parsing of the customEnvVars Helm value when values contain = characters.
  • ReplicaSet: Blocked disabling TLS and changing member count simultaneously. These operations must now be applied separately to prevent configuration inconsistencies.
  • Fixed inability to specify cluster-wide privileges in custom roles.

Other Changes

  • Simplified MongoDB Search setup: Removed the custom Search Coordinator polyfill (a piece of compatibility code previously needed to add the required permissions), as MongoDB 8.2.0 and later now include the necessary permissions via the built-in searchCoordinator role.
  • kubectl-mongodb plugin: cosign, the signing tool that is used to sign kubectl-mongodb plugin binaries, has been updated to version 3.0.2. With this change, released binaries will be bundled with .bundle files containing both signature and certificate information. For more information on how to verify signatures using new cosign version please refer to -> https://github.com/sigstore/cosign/blob/v3.0.2/doc/cosign_verify-blob.md

@lucian-tosa lucian-tosa added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Nov 10, 2025
@lucian-tosa lucian-tosa marked this pull request as ready for review November 10, 2025 13:55
@lucian-tosa lucian-tosa requested a review from a team as a code owner November 10, 2025 13:55
Copy link
Contributor

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a moment during the operator upgrade where the resource has the status of AppDB and OM set to running. This happens very briefly before the operator starts reconciling OM and sets the OM status to Pending.

Great observation! I think we should address this root cause instead of hacking around and making tests pass.

It will likely take more time so I suggest to separate this work from other flake fixes (where we rightfully increase timeouts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants