Flaky test_find_node in kad_dht on GitHub CI (async race/timing issue) #954

seetadev · 2025-09-24T16:44:00Z

seetadev
Sep 24, 2025
Maintainer

@sumanjeet0012 , @acul71 , @yashksaini-coder and @bomanaps : We’ve been seeing intermittent failures in CI/CD for py-libp2p, specifically in:

tests/core/kad_dht/test_kad_dht.py::test_find_node

Locally: ✅ passes reliably
On GitHub Actions: ❌ fails with AssertionError
All other ~694 tests pass; only this one fails intermittently

This strongly suggests async race conditions / timing sensitivity rather than a core logic regression. GitHub runners are slower and noisier, which makes these issues more likely to show up.

Root Cause

find_node queries depend on routing table updates and async task completion.
CI introduces delays → lookup sometimes fails before routing is ready.
The test is therefore flaky, not deterministically broken.

🔧 Options for Fixing in CI

Retry inside the test → handle transient async delays.
Run DHT tests serially in GitHub Actions (disable xdist).
Mark flaky with reruns using pytest-rerunfailures (safety net).

Combined Patch (Retry + Flaky Mark)

Here’s a copy-paste ready patch to make test_find_node more resilient (please use trio instead of asyncio. I had used asyncio for experiments only as documentation was well available for it):

diff --git a/tests/core/kad_dht/test_kad_dht.py b/tests/core/kad_dht/test_kad_dht.py
index abc123..def456 100644
--- a/tests/core/kad_dht/test_kad_dht.py
+++ b/tests/core/kad_dht/test_kad_dht.py
@@
 import pytest
 import asyncio

+
+async def retry(coro, retries=3, delay=0.5):
+    """
+    Retry a coroutine a few times to avoid flakiness on CI.
+    Useful for async race conditions where routing tables
+    may not be populated fast enough.
+    """
+    for i in range(retries):
+        try:
+            return await coro()
+        except AssertionError:
+            if i == retries - 1:
+                raise
+            await asyncio.sleep(delay)
+
 
-@pytest.mark.asyncio
-async def test_find_node(setup_dht_nodes):
-    node, target_id = setup_dht_nodes
-    result = await node.find_node(target_id)
-    assert target_id in result
+@pytest.mark.asyncio
+@pytest.mark.flaky(reruns=3, reruns_delay=1)
+async def test_find_node(setup_dht_nodes):
+    node, target_id = setup_dht_nodes
+    result = await retry(lambda: node.find_node(target_id))
+    assert target_id in result

GitHub Actions Change (Optional: Disable `xdist` for kad_dht)

In .github/workflows/tests.yml:

jobs:
  tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - name: Install dependencies
        run: pip install -e .[dev] pytest-rerunfailures
      - name: Run core tests (with xdist)
        run: pytest tests/core --maxfail=1 -q -n auto -p no:warnings --disable-warnings --deselect=tests/core/kad_dht
      - name: Run kad_dht tests serially
        run: pytest tests/core/kad_dht --maxfail=1 -q --disable-warnings -p no:xdist

Next Steps

Try the combined patch (retry + flaky decorator) first.
If still flaky: run kad_dht tests serially in CI.
If still flaky after that: revisit find_node internals (may need explicit bootstrapping or stabilization).

We should take the CI-only approach (serial execution) to keep the test code clean at this juncture.

acul71 · 2025-09-24T21:39:53Z

acul71
Sep 24, 2025

@sumanjeet0012 , @acul71 , @yashksaini-coder and @bomanaps : We’ve been seeing intermittent failures in CI/CD for py-libp2p, specifically in:
tests/core/kad_dht/test_kad_dht.py::test_find_node
Locally: ✅ passes reliably

On GitHub Actions: ❌ fails with AssertionError

All other ~694 tests pass; only this one fails intermittently

This strongly suggests async race conditions / timing sensitivity rather than a core logic regression. GitHub runners are slower and noisier, which makes these issues more likely to show up.

Root Cause

find_node queries depend on routing table updates and async task completion.

CI introduces delays → lookup sometimes fails before routing is ready.

The test is therefore flaky, not deterministically broken.

Opened Issue in #956
Submitting fix in #957

0 replies

acul71 · 2025-09-24T21:45:33Z

acul71
Sep 24, 2025

Race Condition Analysis: Kademlia DHT Tests

🎯 Objective

Identify tests in test_kad_dht.py that could fail for the same race condition reasons as test_find_node, and provide recommendations for fixes.

🔍 Root Cause Pattern

The flaky test issue stems from async race conditions where:

DHT nodes are not fully initialized before test execution
Routing tables are not populated with peer information
Peer discovery operations depend on timing-sensitive network operations
Signed peer records may not be exchanged properly between nodes

📊 Test Analysis Results

✅ Tests Already Fixed

Test Function	Status	Race Condition Risk
`test_find_node`	FIXED	✅ Eliminated with enhanced `dht_pair` fixture

⚠️ Tests at HIGH Risk of Race Conditions

1. `test_put_and_get_value` - HIGH RISK ✅ VERIFIED

Race Condition Indicators:

Line 197: await dht_a.routing_table.add_peer(peer_b_info) - Manual peer addition
Line 213: await dht_a.put_value(key, value) - DHT operation without retry
Line 248: await dht_b.get_value(key) - DHT operation without retry
Line 240: await trio.sleep(0.5) - Arbitrary sleep suggests timing sensitivity

Potential Issues:

PUT_VALUE operation may fail if routing table isn't fully populated
GET_VALUE operation may fail if value propagation is incomplete
No retry mechanism for DHT operations
Relies on manual peer addition instead of using enhanced fixture

Recommended Fixes:

# Add retry mechanism
with trio.fail_after(TEST_TIMEOUT):
    await retry(dht_a.put_value(key, value))

with trio.fail_after(TEST_TIMEOUT):
    retrieved_value = await retry(dht_b.get_value(key))

2. `test_provide_and_find_providers` - HIGH RISK ✅ VERIFIED

Race Condition Indicators:

Line 301: await dht_a.provide(content_id) - Provider advertisement without retry
Line 331: await dht_b.find_providers(content_id) - Provider discovery without retry
Line 327: await trio.sleep(0.1) - Arbitrary sleep suggests timing sensitivity
Line 361: await dht_b.get_value(content_id) - Value retrieval without retry

Potential Issues:

Provider advertisement may fail if nodes aren't properly connected
Provider discovery may fail if advertisement hasn't propagated
No retry mechanism for DHT operations
Multiple sequential DHT operations without proper error handling

Recommended Fixes:

# Add retry mechanism for all DHT operations
with trio.fail_after(TEST_TIMEOUT):
    success = await retry(dht_a.provide(content_id))

with trio.fail_after(TEST_TIMEOUT):
    providers = await retry(dht_b.find_providers(content_id))

with trio.fail_after(TEST_TIMEOUT):
    retrieved_value = await retry(dht_b.get_value(content_id))

3. `test_reissue_when_listen_addrs_change` - MEDIUM RISK ✅ VERIFIED

Race Condition Indicators:

Line 429: await dht_a.find_peer(dht_b.host.get_id()) - Peer discovery without retry
Line 442: await dht_a.peer_routing._query_peer_for_closest(...) - Internal DHT operation

Potential Issues:

Initial peer discovery may fail if routing table isn't populated
Internal DHT query may fail if peer connection isn't established

Recommended Fixes:

# Add retry mechanism for peer discovery
with trio.fail_after(10):
    await retry(dht_a.find_peer(dht_b.host.get_id()))

⚠️ Tests at MEDIUM Risk

4. `test_dht_req_fail_with_invalid_record_transfer` - MEDIUM RISK ✅ VERIFIED

Race Condition Indicators:

Line 474: await dht_a.routing_table.add_peer(peer_b_info) - Manual peer addition
Line 486: await dht_a.put_value(key, value) - DHT operation without retry
Line 499: await dht_a.put_value(key, value) - DHT operation without retry

Why Medium Risk (not Low):

Contains manual peer addition and DHT operations without retry
Tests failure scenarios but still performs DHT operations
May fail due to race conditions in the setup phase

🛠️ Recommended Fixes

Immediate Actions Required:

Add Retry Mechanism to High-Risk Tests:

# Apply to all DHT operations in high-risk tests
await retry(dht_operation())

Add Flaky Test Markers:

@pytest.mark.flaky(reruns=3, reruns_delay=1)

Remove Manual Peer Addition:
- Remove await dht_a.routing_table.add_peer(peer_b_info) calls
- Rely on enhanced dht_pair fixture for peer discovery
Replace Arbitrary Sleeps:
- Remove await trio.sleep(0.5) and similar calls
- Use proper retry mechanisms instead

Files to Modify:

tests/core/kad_dht/test_kad_dht.py - Add retry mechanisms to high-risk tests
pyproject.toml - Already updated with pytest-rerunfailures

📈 Risk Assessment Summary

Test Function	Risk Level	Primary Issues	Fix Complexity
`test_find_node`	✅ FIXED	Race conditions	✅ COMPLETED
`test_put_and_get_value`	🔴 HIGH	No retry, manual peer addition	🟡 MEDIUM
`test_provide_and_find_providers`	🔴 HIGH	No retry, multiple DHT ops	🟡 MEDIUM
`test_reissue_when_listen_addrs_change`	🟡 MEDIUM	No retry for peer discovery	🟢 LOW
`test_dht_req_fail_with_invalid_record_transfer`	🟡 MEDIUM	Manual peer addition, DHT ops	🟡 MEDIUM

🎯 Next Steps

Priority 1: Fix test_put_and_get_value and test_provide_and_find_providers (HIGH RISK)
Priority 2: Fix test_dht_req_fail_with_invalid_record_transfer (MEDIUM RISK)
Priority 3: Fix test_reissue_when_listen_addrs_change (MEDIUM RISK)
Priority 4: Add comprehensive test coverage for race conditions

🔧 Implementation Strategy

Apply retry mechanism to all DHT operations in high-risk tests
Add flaky test markers to tests that may still have timing issues
Remove manual peer addition and rely on enhanced fixture
Add proper error handling for DHT operations
Consider adding timeout configurations for different CI environments

This analysis identifies 3 additional tests that could benefit from the same race condition fixes applied to test_find_node.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flaky test_find_node in kad_dht on GitHub CI (async race/timing issue) #954

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Root Cause

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Flaky test_find_node in kad_dht on GitHub CI (async race/timing issue) #954

Uh oh!

seetadev Sep 24, 2025 Maintainer

Root Cause

🔧 Options for Fixing in CI

Combined Patch (Retry + Flaky Mark)

GitHub Actions Change (Optional: Disable xdist for kad_dht)

Next Steps

Replies: 2 comments

Uh oh!

acul71 Sep 24, 2025

Root Cause

Uh oh!

acul71 Sep 24, 2025

Race Condition Analysis: Kademlia DHT Tests

🎯 Objective

🔍 Root Cause Pattern

📊 Test Analysis Results

✅ Tests Already Fixed

⚠️ Tests at HIGH Risk of Race Conditions

1. test_put_and_get_value - HIGH RISK ✅ VERIFIED

2. test_provide_and_find_providers - HIGH RISK ✅ VERIFIED

3. test_reissue_when_listen_addrs_change - MEDIUM RISK ✅ VERIFIED

⚠️ Tests at MEDIUM Risk

4. test_dht_req_fail_with_invalid_record_transfer - MEDIUM RISK ✅ VERIFIED

🛠️ Recommended Fixes

Immediate Actions Required:

Files to Modify:

📈 Risk Assessment Summary

🎯 Next Steps

🔧 Implementation Strategy

seetadev
Sep 24, 2025
Maintainer

GitHub Actions Change (Optional: Disable `xdist` for kad_dht)

acul71
Sep 24, 2025

acul71
Sep 24, 2025

1. `test_put_and_get_value` - HIGH RISK ✅ VERIFIED

2. `test_provide_and_find_providers` - HIGH RISK ✅ VERIFIED

3. `test_reissue_when_listen_addrs_change` - MEDIUM RISK ✅ VERIFIED

4. `test_dht_req_fail_with_invalid_record_transfer` - MEDIUM RISK ✅ VERIFIED