Skip to content

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Aug 19, 2025

KgoVerifierService objects should be waited upon and then stopped, not the other way around.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 19, 2025

Retry command for Build#71026

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_prefetch_chunks@{"prefetch":0}
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_fallback_mode
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_when_cache_smaller_than_segment_size
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_chunks
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_prefetch_chunks@{"prefetch":5}
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_when_segment_size_smaller_than_chunk_size
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_prefetch_chunks@{"prefetch":3}
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_when_chunk_api_disabled
tests/rptest/tests/e2e_shadow_indexing_test.py::ShadowIndexingManyPartitionsTest.test_many_partitions_shutdown
tests/rptest/tests/nodes_decommissioning_test.py::NodesDecommissioningTest.test_learner_gap_metrics

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 20, 2025

Retry command for Build#71031

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_when_chunk_api_disabled
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_fallback_mode
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_when_segment_size_smaller_than_chunk_size
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_chunks

`KgoVerifierService` objects should be waited upon and then stopped,
not the other way around.
@WillemKauf WillemKauf changed the title rptest: catch bad tests rptest: assert !_stopped in KgoVerifierServices::wait() implementations Aug 20, 2025
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 20, 2025

Retry command for Build#71034

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/nodes_decommissioning_test.py::NodesDecommissioningTest.test_learner_gap_metrics
tests/rptest/tests/e2e_shadow_indexing_test.py::ShadowIndexingTrafficShapingTest.test_many_partitions_recovery_with_traffic_shaping

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 20, 2025

CI test results

test results on build#71034
test_class test_method test_arguments test_kind job_url test_status passed reason
ShadowIndexingManyPartitionsTest test_many_partitions_shutdown null integration https://buildkite.com/redpanda/redpanda/builds/71034#0198c558-0e9b-4543-beb8-d242710c5528 FAIL 0/21
ShadowIndexingTrafficShapingTest test_many_partitions_recovery_with_traffic_shaping null integration https://buildkite.com/redpanda/redpanda/builds/71034#0198c558-0e9c-4909-8ef4-7d47ec51d546 FAIL 0/21 The test has failed across all retries
NodesDecommissioningTest test_learner_gap_metrics null integration https://buildkite.com/redpanda/redpanda/builds/71034#0198c558-0e99-4579-9cb1-615d5bddffdd FAIL 0/21 The test has failed across all retries
test results on build#71074
test_class test_method test_arguments test_kind job_url test_status passed reason
EndToEndCloudTopicsTxTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/71074#0198c7e5-fa35-4a98-86a5-7126d3568fe0 FLAKY 18/21 upstream reliability is '93.12883435582822'. current run reliability is '85.71428571428571'. drift is 7.41455 and the allowed drift is set to 50. The test should PASS
ShadowIndexingManyPartitionsTest test_many_partitions_recovery null integration https://buildkite.com/redpanda/redpanda/builds/71074#0198c7df-8c69-4d32-b0c3-62ee1b0428b2 FAIL 0/5
test results on build#71084
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/71084#0198c883-c5a8-4af9-a7fd-ad391a0e380b FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
test results on build#71154
test_class test_method test_arguments test_kind job_url test_status passed reason
FeaturesMultiNodeUpgradeTest test_rollback null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-0823-4526-a451-83bd15e4923a FAIL 0/21 The test has failed across all retries
FeaturesMultiNodeUpgradeTest test_rollback null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8544-431b-8660-3fc942108d0e FAIL 0/21 The test has failed across all retries
FeaturesMultiNodeUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-0824-4d47-9d34-97d500413903 FAIL 0/21 The test has failed across all retries
FeaturesMultiNodeUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8545-4fff-9b6d-b22f0e73b8ed FAIL 0/21 The test has failed across all retries
FeaturesNodeJoinTest test_old_node_join null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-0825-4f4a-be4e-1e3566b14cef FAIL 0/21 The test has failed across all retries
FeaturesNodeJoinTest test_old_node_join null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8545-4afd-8140-5fca75b99477 FAIL 0/21 The test has failed across all retries
FeaturesSingleNodeUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-081f-4c3a-983a-7bfd271e0692 FAIL 0/1
FeaturesSingleNodeUpgradeTest test_upgrade null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8540-472a-9566-369415bba173 FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": false, "clean_node_before_recovery": false} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-081d-4b15-86c1-695f334fea11 FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": false, "clean_node_before_recovery": false} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8543-4ec1-872b-b00dae967717 FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": true, "clean_node_before_recovery": false} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-081e-480b-893a-a2dc86c6d34b FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": true, "clean_node_before_recovery": false} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8544-431b-8660-3fc942108d0e FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": false, "clean_node_before_recovery": true} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-081f-4c3a-983a-7bfd271e0692 FAIL 0/1
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": false, "clean_node_before_recovery": true} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8545-4fff-9b6d-b22f0e73b8ed FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": true, "clean_node_before_recovery": true} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-0820-42b6-bd65-4134a31211a1 FAIL 0/21 The test has failed across all retries
LicenseEnforcementTest test_license_enforcement {"clean_node_after_recovery": true, "clean_node_before_recovery": true} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8545-4afd-8140-5fca75b99477 FAIL 0/21 The test has failed across all retries
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-081f-4c3a-983a-7bfd271e0692 FAIL 0/1
TopicIDUpgradeTest test_topic_id_migration null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccef-0821-48cd-93fe-ea5ccf12fe1d FAIL 0/21 The test has failed across all retries
TopicIDUpgradeTest test_topic_id_migration null integration https://buildkite.com/redpanda/redpanda/builds/71154#0198ccf5-8547-44dc-9607-81ff1befbf94 FAIL 0/21 The test has failed across all retries
test results on build#71191
test_class test_method test_arguments test_kind job_url test_status passed reason
DatalakeE2ETests test_json_schema_unicode {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/71191#0198ce68-c7aa-45a0-ac87-abbd5941f7ec FLAKY 20/21 upstream reliability is '98.37209302325581'. current run reliability is '95.23809523809523'. drift is 3.134 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/71191#0198ce68-c7aa-45a0-ac87-abbd5941f7ec FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#71239
test_class test_method test_arguments test_kind job_url test_status passed reason
NodesDecommissioningTest test_learner_gap_metrics null integration https://buildkite.com/redpanda/redpanda/builds/71239#0198d213-ad64-4483-89a6-bed5a8618976 FAIL 0/21 The test has failed across all retries
WriteCachingFailureInjectionTest test_unavoidable_data_loss null integration https://buildkite.com/redpanda/redpanda/builds/71239#0198d213-ad6b-497b-9caa-32f8cf746895 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#71250
test_class test_method test_arguments test_kind job_url test_status passed reason
EndToEndCloudTopicsTxTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/71250#0198d2a8-5fc9-4f6d-b37b-04ce81e20388 FLAKY 19/21 upstream reliability is '92.6056338028169'. current run reliability is '90.47619047619048'. drift is 2.12944 and the allowed drift is set to 50. The test should PASS
DatalakeE2ETests test_json_schema_unicode {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/71250#0198d2af-325f-458e-9d4f-c355179eda5f FLAKY 20/21 upstream reliability is '98.39080459770115'. current run reliability is '95.23809523809523'. drift is 3.15271 and the allowed drift is set to 50. The test should PASS

@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
release
skip-redpanda-build
skip-units
dt-repeat=1
tests/rptest/scale_tests/ht_partition_movement_test.py

@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
release
skip-redpanda-build
skip-units
dt-repeat=1
tests/rptest/scale_tests/many_partitions_test.py

@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
release
skip-redpanda-build
skip-units
dt-repeat=1
tests/rptest/scale_tests/partition_balancer_scale_test.py

@WillemKauf
Copy link
Contributor Author

/ci-repeat 1
release
skip-redpanda-build
skip-units
dt-repeat=1
tests/rptest/scale_tests/tiered_storage_io_stress_test.py

nvartolomei
nvartolomei previously approved these changes Aug 21, 2025
Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly questions about dead code; the assert is useful to prevent incorrect use/assumptions

f"Can't wait {self.who_am_i()}. It was already stopped."
f" You can either stop() a service or wait() and then stop() it"
f" but not the other way around.")
assert not self._stopped, f"Can't wait {self.who_am_i()}. It was already stopped. You can either stop() a service or wait() and then stop() it but not the other way around."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably contribute this to ducktape itself

@@ -100,6 +105,7 @@ def test_moving_single_partition_under_load(self, replication_factor):
self._do_move_and_verify(t, partition, 600)

self.verify(topic.name)
self.cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify already called wait(). cleanup is not necessarily needed then, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

def verify(self, topic_name, msg_size, consumers):
self.producer.wait()

# Await the consumer that is reading only the subset of data that
# was written before it started.
self.consumer.wait()
self.consumer.stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a noop, right? wait encapsulates a stop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -169,3 +174,4 @@ def test_io_stress(self, cloud_storage_type, segment_size,
self._create_initial_topics()

self._workload(segment_size, interval_uploads)
self.cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: why do we need this? it is all dead code to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 21, 2025

Retry command for Build#71154

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cluster_features_test.py::FeaturesSingleNodeUpgradeTest.test_upgrade
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":false,"clean_node_before_recovery":false}
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":true,"clean_node_before_recovery":false}
tests/rptest/tests/cluster_features_test.py::FeaturesMultiNodeUpgradeTest.test_rollback
tests/rptest/tests/cluster_features_test.py::FeaturesNodeJoinTest.test_old_node_join
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":true,"clean_node_before_recovery":true}
tests/rptest/tests/cluster_features_test.py::FeaturesMultiNodeUpgradeTest.test_upgrade
tests/rptest/tests/topic_id_migrator_test.py::TopicIDUpgradeTest.test_topic_id_migration
tests/rptest/tests/license_enforcement_test.py::LicenseEnforcementTest.test_license_enforcement@{"clean_node_after_recovery":false,"clean_node_before_recovery":true}

@WillemKauf
Copy link
Contributor Author

Rebased

mmaslankaprv
mmaslankaprv previously approved these changes Aug 22, 2025
@@ -490,7 +492,6 @@ def calculate_total_learners_gap() -> int | None:

self.logger.info(f"decommissioning node: {to_decommission_id}", )
self._decommission(to_decommission_id)
self.producer.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is wait removed from there? this changes what the test does no? previously the test waited for consumer to finish before proceeding. but now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. reverted.

Copy link
Contributor Author

@WillemKauf WillemKauf Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert not self._stopped, f"Can't wait {self.who_am_i()}. It was already stopped. You can either stop() a service or wait() and then stop() it but not the other way around." AssertionError: Can't wait KgoVerifierProducer-0-140548171042240. It was already stopped. You can either stop() a service or wait() and then stop() it but not the other way around.

Oh, that's why I removed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second producer.wait() call in verify() will fail with the new assert if this wait() is left here.
It is also unnecessary for the test as written.

nvartolomei
nvartolomei previously approved these changes Aug 22, 2025
@nvartolomei nvartolomei enabled auto-merge August 22, 2025 16:04
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#71239

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/nodes_decommissioning_test.py::NodesDecommissioningTest.test_learner_gap_metrics

The second `producer.wait()` call in `verify()` will fail with the new
assert if this `wait()` is left here.

It is also unnecessary for the test as written.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants