Replies: 2 comments
-
|
Beta Was this translation helpful? Give feedback.
-
Race Condition Analysis: Kademlia DHT Tests🎯 ObjectiveIdentify tests in 🔍 Root Cause PatternThe flaky test issue stems from async race conditions where:
📊 Test Analysis Results✅ Tests Already Fixed
|
Test Function | Risk Level | Primary Issues | Fix Complexity |
---|---|---|---|
test_find_node |
✅ FIXED | Race conditions | ✅ COMPLETED |
test_put_and_get_value |
🔴 HIGH | No retry, manual peer addition | 🟡 MEDIUM |
test_provide_and_find_providers |
🔴 HIGH | No retry, multiple DHT ops | 🟡 MEDIUM |
test_reissue_when_listen_addrs_change |
🟡 MEDIUM | No retry for peer discovery | 🟢 LOW |
test_dht_req_fail_with_invalid_record_transfer |
🟡 MEDIUM | Manual peer addition, DHT ops | 🟡 MEDIUM |
🎯 Next Steps
- Priority 1: Fix
test_put_and_get_value
andtest_provide_and_find_providers
(HIGH RISK) - Priority 2: Fix
test_dht_req_fail_with_invalid_record_transfer
(MEDIUM RISK) - Priority 3: Fix
test_reissue_when_listen_addrs_change
(MEDIUM RISK) - Priority 4: Add comprehensive test coverage for race conditions
🔧 Implementation Strategy
- Apply retry mechanism to all DHT operations in high-risk tests
- Add flaky test markers to tests that may still have timing issues
- Remove manual peer addition and rely on enhanced fixture
- Add proper error handling for DHT operations
- Consider adding timeout configurations for different CI environments
This analysis identifies 3 additional tests that could benefit from the same race condition fixes applied to test_find_node
.
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@sumanjeet0012 , @acul71 , @yashksaini-coder and @bomanaps : We’ve been seeing intermittent failures in CI/CD for
py-libp2p
, specifically in:AssertionError
This strongly suggests async race conditions / timing sensitivity rather than a core logic regression. GitHub runners are slower and noisier, which makes these issues more likely to show up.
Root Cause
find_node
queries depend on routing table updates and async task completion.🔧 Options for Fixing in CI
xdist
).pytest-rerunfailures
(safety net).Combined Patch (Retry + Flaky Mark)
Here’s a copy-paste ready patch to make
test_find_node
more resilient (please use trio instead of asyncio. I had used asyncio for experiments only as documentation was well available for it):GitHub Actions Change (Optional: Disable
xdist
for kad_dht)In
.github/workflows/tests.yml
:Next Steps
kad_dht
tests serially in CI.find_node
internals (may need explicit bootstrapping or stabilization).We should take the CI-only approach (serial execution) to keep the test code clean at this juncture.
Beta Was this translation helpful? Give feedback.
All reactions