Skip to content

Conversation

@benbenwilde
Copy link
Contributor

Description

Can now use ActorSystemConfig.WithBlockedMemberDuration() to configure how long a member will remain in block list after blocked. The default is 1 hour, which is the same value as before.

Changed behavior of BlockList.IsBlocked() - before, a blocked member would be considered blocked forever. Now it uses the same logic as BlockList.BlockedMembers, which I think would be expected.

Lastly, when a member leaves it was removed from all the places it was added when it originally joined, except the _metaMembers dictionary. Now it will be removed from here as well. Before this change, a member could never successfully rejoin.

Background:
We had an issue with our cluster provider implementation which momentarily provided invalid data - which is being fixed separately. Nonetheless, once it recovered, any wrongfully removed members were never able to be successfully re-added. This resulted in a bad state where the cluster provider was providing a list of members but they were never added back to the member list, which of course caused a number of issues. Now, members have the ability to rejoin, and they can be configured to be blocked for less than an hour.

Purpose

This pull request is a:

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

…nto account. When member leaves, remove everywhere.
@rogeralsing rogeralsing merged commit b4d8093 into asynkron:dev Mar 6, 2025
18 checks passed
rogeralsing added a commit that referenced this pull request Oct 11, 2025
* Eventstream channels (#2101)

* subscribe to eventstream using typed channel

* assert publish

* rename

* debug cluster tests (#2104)

Rearrange cluster shutdown code

* fix actor fails to recive batching messages if batchsize is less than the number of messages in the batchingMailbox because of dead loop (#2105)

* Fix so if _oldest and _cleanedate equals it will still perform the cleaning in Deduplicator (#2106)

* Added support for propagating OpenTelemetry baggage (#2107)

* Added support for propagating OpenTelemetry baggage

* Updated tests, only set baggage if it's in use

* Mssqlprovider improvements (#2108)

* Switch from System.Data.SqlClient to Microsoft.Data.SqlClient per official recommendation

* use await on using

* feat: 2110 memory_leak_bugfix (#2111)

* With simpler execution plan by avoiding unnecessary sorting of all of an actor's snapshots (#2115)

* using await on types implementing IAsyncDisposable (#2114)

* Activity source (#2119)

* Track Spawning
* Track EventStream Publish
* Track EventStream Subscription

* Graceful stop (#2121)

* graceful stop of cluster actors

* Downgrade this log to Warning (#2124)

* caching the CancellationToken will prevent race condition (#2125)

To be a good citizen, it is important that the caller of this method can (and should) call Dispose() on the returned token source. The problem though, is that each time cts.Token was called before, it would try to create a new CancellationToken which means that it could throw an ObjectDisposed exception leading to noisy logs.

* Support dotnet 8 (#2126)

* support dotnet8 target

* make sure all the right versions are installed for dotnet 8 testing

* No need for pre-release since it is no longer an argument

* Try with the newest version of the action

* . (#2120)

* Expose some internal properties as public (#2127)

In order to write a proper alternative to the DefaultClusterContext, we need these properties to be public

* update copyrights (#2131)

* update copyrights

* add receive timeout (#2132)

* add receive timeout

* Pass cancellation token to grpc reader (#2133)

* Dependabot lite (#2134)

* Upgraded dependencies

* Dependency upgrades & cleanup

* Centralized package management (#2135)

* Migrated to centralized package management

* Made the persistence tests ~90 times faster by using fixtures for the external DB's

* Version bumps

* otel decorator should implement all ReenterAfter methods + add tests for baggage on ReenterAfter (#2138)

* Fix - Under load and during topology changes, thread saturation can occur, causing a lockup (#2139)

* Add endpoint manager test to repro thread lockup

fix merge

* add explanation and dockerfile

* Block endpoint while it disposes instead of holding requests behind a lock. this also allows messages to other endpoints while one is disposing, as well as multiple endpoint disposes at the same time.

* Change EndpointManager Stop to StopAsync

* ensure the EndpointReader always finishes up the request after sending the DisconnectRequest, so it doesn't time out during kestrel shutdown

* increase timeout on a couple tests so they fail less often

* increase another timeout on flakey test

* Ignore OperationCanceledException in SafeTask (#2140)

* Support adding additional error handling to generated actors (#2141)

* Support adding additional error handling to generated actors

updates for tests

* add delay for flaky test

* Fix issue reconnecting to a cluster client or member. Bug was introduced in 1.7.1.alpha-0.4 build. (#2142)

* Otel Updates (#2144)

* Bump OTEL packages

* Moved spawn trace activity metadata from name to tags

* Bump grpc / protobuf versions

* After an actor is stopped, block continuations from running (#2146)

* After an actor is stopped, block continuations from running

* Additionally check the actors cancellation token before sending the Continuation

* Fix for case where a TopicActor can continue sending messages to a member that no longer exists. (#2147)

* Add try catch to cluster shutdown, so if there are any errors, anything that is waiting for shutdown will still see it (#2156)

* Add GrainException to support better error handling from generated grains (#2157)

* Add GrainException which enables throwing a GrainException back on the grain client with a specific message and code from inside a grain message handler. This enable application code to easily provide different error handling behavior based on the code.

* add documentation to GrainException

* Blocked member duration configurable. IsBlocked takes this duration into account. When member leaves, remove everywhere. (#2158)

* update nugets (#2163)

update nugets

* update nugets (#2164)

* Implemented etcd IClusterProvider (#2150)

* Implemented etcd IClusterProvider

* Remove JsonSerializerOptions from EtcdProviderConfig and update serialization methods in EtcdProvider

---------

Co-authored-by: Tyrone Groves <[email protected]>

* fix build (#2165)

* Fix spelling of 'occurred' in cluster logs (#2170)

* Fix typo in event sourcing exception summary (#2171)

* refactor: replace Thread.Sleep with async wait (#2172)

* fix typos in identity project (#2173)

* test: strengthen actor registry tests (#2174)

* Wrapping Task.Delay with an implementation that uses TimeProvider in … (#2149)

* Wrapping Task.Delay with an implementation that uses TimeProvider in .NET 8 or higher

* updated TimerExtensions as well

* Testing Scheduler.SendOnce with TimeProvider

* fixed build due to missing PackageVersion

* Moved version of dotnet-etcd to central package management

---------

Co-authored-by: Roger Johansson <[email protected]>

* docs: add codebase overview (#2175)

* docs: expand example README explanations (#2176)

* Update saga docs for async delay (#2177)

* docs: fix spelling of synonym in Terminology.md (#2178)

* remote: replace async void Disconnect with Task-returning method (#2179)

* Update ActorSystemExtensions.cs

* docs: fix terminology for Proto.Cluster (#2182)

* fix: correct header file reference (#2183)

* chore: update copyright headers (#2184)

* Cleanup actor counts for departed members (#2186)

* Add OpenTelemetry tracing example (#2187)

* docs: add TLS certificate generation instructions (#2189)

* Add ScheduledMessages example (#2188)

* test: remove arbitrary delays in ClusterTests for determinism (#2180)

* Improve mailbox scheduling test assertion (#2191)

* Add unwatch termination test to WatchTests (#2192)

* test: ensure unwatch prevents terminated

* Fix flakey stop assertion in supervision test

* test: verify scheduler cancel repeated sends (#2194)

* test: verify receive timeout resets on activity (#2195)

* test: add routing coverage for pool routers (#2197)

* test: add missing pool router coverage (#2198)

* test: extend scheduler coverage (#2196)

* test: expand process registry coverage (#2199)

* test: cover always restart supervision (#2201)

* test: cover supervision restart scenarios (#2203)

* chore: remove unneeded tests (#2204)

* Stabilize actor metrics and supervision tests (#2205)

* Return early on duplicate cluster activation requests (#2206)

* Enable NuGet caching for workflows (#2207)

* chore: target only .NET 8 (#2208)

* fix duplicate dotnet-etcd package entry (#2209)

* chore: revert Couchbase upgrade (#2210)

* feat: cache build outputs (#2212)

* chore: reduce workflow fetch depth (#2215)

* Make coverage optional (#2218)

* Only produce symbol packages in CI (#2217)

* Fix BuildStartedActivity overload (#2211)

* chore: modernize example programs (#2219)

* docs: summarize eventstream interactions (#2220)

* docs: summarize eventstream interactions

* docs: fix EventStream file links

* docs: document membership gossip flow (#2221)

* docs: document membership gossip flow

* Update CLUSTER_MEMBERSHIP_GOSSIP.md

* docs: link root documentation (#2222)

* Escalate unknown system messages (#2223)

* test: cover actor implementing supervisor strategy (#2224)

* refactor: streamline gossip callbacks (#2225)

* refactor: streamline gossip callbacks

* refactor: purge on topology update

* test: broaden gossip consensus coverage

* Add cluster gossip example (#2226)

* test: add graceful leave blocklist test (#2227)

* test(cluster): add gossip request validation tests (#2228)

* test(cluster): add gossip request validation tests

* test(cluster): remove reflection for gossip block

* test: document MemberStateDelta bug in redundant gossip (#2229)

* test: simulate gossip partition without blocklist (#2231)

* Add regression test for unreachable subscribers (#2233)

* Remove Respond logging test (#2234)

* refactor: remove grpc core support (#2236)

* refactor: remove grpc core support

* Update EndpointManager.cs

* feat: unify remote configuration (#2237)

* feat: unify remote config

* chore: update samples for unified remote config

* Test gossip replication with real cluster (#2238)

* test: add large message envelope test (#2239)

* Move message batch creation to readers (#2240)

* fix: restore required usings (#2241)

* fix: tighten consensus check nullability (#2243)

* Use UsingSnapshotting property during recovery (#2244)

* Memoize Rendezvous member filtering (#2245)

* Use context cancellation tokens in topic actor (#2246)

* Use configurable activation timeout (#2247)

* Use typed logging across core libraries (#2248)

* Replace raw log statements in Proto.Actor (#2249)

* Add typed logging partial classes for Proto.Actor

* fix supervision log method ambiguity

* chore: add generated logging declarations (#2250)

* test: add guardian process tests (#2252)

* Add tests for TimerExtensions scheduler (#2251)

* test: add stashing tests (#2253)

* Remove obsolete Http2UnencryptedSupport switches (#2254)

* Replace custom WaitUpTo with Task.WaitAsync (#2255)

* Replace custom WaitUpTo with Task.WaitAsync

* Add missing using for Retry helper

* Use Task.Run and unwrap scheduler tasks (#2256)

* Use asynchronous continuations for task completion (#2257)

* Use shared or seeded Random (#2260)

* Remove System.ValueTuple package (#2259)

* Use Thread.Yield for non-blocking MPMCQueue backoff (#2258)

* Replace blocking Result with awaits (#2261)

* Avoid channel closed exception when stopping batching producer (#2262)

* docs: add agents instructions (#2263)

* docs: add agents instructions

* Update AGENTS.md

* Update AGENTS.md

* Update AGENTS.md

* ci: run tests on dev branch manually (#2264)

* ci: run tests on dev branch manually

* Potential fix for code scanning alert no. 10: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

---------

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* test: verify custom props middleware (#2266)

* Track activation requests in partition identity tests (#2267)

* refactor: rename activation request sent metric (#2268)

* Track forwarded activation metrics in partition identity tests (#2269)

* Include activation stats in PartitionIdentity test failures (#2270)

* Include actor start count in activation stats (#2274)

* chore: simplify supervision log messages (#2275)

* unobsolete logger (#2276)

* Reorganize examples by core library (#2280)

* refactor: unify ReenterAfter with generic helpers (#2281)

* refactor: unify ReenterAfter with generic helpers

* feat: provide convenience ReenterAfter overloads

* test: ensure ReenterAfter extensions hit decorators

* refactor: relocate config and context (#2282)

* refactor: rename Member directory to Membership (#2283)

* Move mailbox interface and factories to separate files (#2284)

* Add TestKit example (#2285)

* Use TotalSeconds in TestProbe messages (#2287)

* Document TraceLens setup (#2286)

* feat: add probes as middleware (#2290)

* test: drop flaky TestMailbox (#2291)

* Introduce ProbeMailboxStatistics and system message helpers (#2292)

* test: use TestProbe for supervision restarts (#2293)

* feat(testkit): move mailbox stats and simplify test probe (#2294)

* Restore timeout messages and tidy wait helpers (#2295)

* Ensure router removal processed before routing (#2297)

* Add ISchedulerHook interface (#2296)

* test: probe-based partition consensus (#2298)

* Ensure routee mailbox drained after removal in router tests (#2299)

* Revert "test: probe-based partition consensus (#2298)" (#2301)

This reverts commit 50777df.

* refactor scheduler tests to use testprobe (#2302)

* refactor: use TestProbe for receive timeout tests (#2304)

* Use TestProbe in timer tests (#2303)

* refactor watch tests to use test probe (#2306)

* Replace watcher actors with TestProbe (#2305)

* Use TestProbe for RequestRepeatedly cancellation test (#2307)

* testkit: add ActorSystem.CreateTestProbe (#2310)

* Replace MyTestActor with TestProbe in broadcast router tests (#2309)

* Replace MyTestActor with TestProbe in broadcast router tests

* Assert both messages for remaining broadcast routees

* Use mailbox stats to await stop in reenter test (#2308)

* Add async helper and extension for TestMailboxStats

* Cancel reenter continuation after awaiting stop

* refactor: use test helpers for probes and mailbox stats (#2311)

* test: address review comments

* test: include mailbox namespace

* test: stabilize core actor tests

* test: await probe start in remote watch tests

* Add predicates for system message expectations (#2312)

* refactor(testkit): split expect/get helpers and refine mailbox check (#2313)

* Use ExpectNextUserMessageAsync in UnknownSystemMessageTests (#2315)

* Use probe to await child restart (#2314)

* Replace actor helpers with test probes (#2316)

* docs: clarify testing guidelines (#2317)

* Update AGENTS.md (#2318)

* test: use probe for concurrent message ack (#2319)

* Use test-kit helpers for no-delivery checks (#2320)

* Update AGENTS.md (#2323)

* refactor: use TestProbe in FunctionActor test (#2324)

* test: await mailbox statistics instead of fixed delays (#2325)

* Use AwaitConditionAsync in mailbox scheduling tests (#2326)

* test: ensure resumed actors report empty mailbox (#2327)

* feat(testkit): wait for probe context (#2330)

* Add change log (#2331)

* test: streamline router tests with ExpectEmptyMailbox (#2332)

* test: await mailbox escalation (#2334)

* docs: focus testkit docs (#2333)

* chore: add change log (#2335)

* add log for actor lifecycle tests (#2336)

* test: use ExpectEmptyMailbox in watch test (#2338)

* chore: add change log (#2339)

* Ensure block list is updated before gossip in graceful leave test (#2341)

* test: use ExpectEmptyMailbox in watch test (#2337)

* test: await pid cache eviction (#2340)

* Extract member state delta builder (#2342)

* chore: add change log (#2343)

* docs: add change log (#2344)

* Use CancellationTokens helper in cluster tests (#2346)

* refactor MemberList topology update (#2347)

* test topology diff

* extract topology builder

* test: expand topology builder scenarios

* Enhance AGENTS.md with setup and coding guidelines

Updated agent instructions to include setup, coding guidelines, refactoring, and testing practices.

* refactor: use shared spawner helper in actor tests (#2345)

* refactor: use shared spawner helper in actor tests

* refactor: inline actor spawns in tests

* chore: add change log (#2348)

* Add tests for retry cancellation (#2350)

* chore: add change log (#2351)

* document deterministic delta helper (#2352)

* docs: document consensus checks (#2353)

* feat(gossip): add deterministic random provider (#2355)

* test: verify MergeStates immutability (#2354)

* Add gossip transport abstraction (#2356)

* feat(gossip): inject member and block list dependencies (#2357)

* Revise coding guidelines for logging practices

* feat: inject gossip config (#2358)

* fix: initialize gossip after cluster services (#2361)

* docs: fix TestProbe example (#2362)

* docs: log cluster test investigation (#2363)

* test: verify gossiper across remote nodes (#2364)

* use concurrent queue for test mailbox stats (#2365)

* chore: add work log (#2366)

* Fix Gossip composite consensus test timing (#2367)

* Fix Gossip composite consensus test timing

* Extract topology consensus wait helper

* refactor: dispose fixtures directly

* test: cover ExpectUpdatedTopologyConsensus

* docs: document gossip design and testkit (#2370)

* testkit: add ExpectMemberToExist helper (#2369)

* Add log for duplicate member fix (#2371)

* chore: note flaky supervision test (#2372)

* docs: clarify gossip type relationships (#2373)

* docs: clarify gossip membership flow (#2374)

* Refactor MemberList into managers (#2375)

* refactor consensus logging helper (#2377)

* Refactor blocking of gracefully left members (#2378)

* Use ToImmutableDictionary in Gossiper.GetState (#2379)

* refactor: clarify variable naming (#2380)

* chore: fix roslynator warnings (#2381)

* Fix nullability warnings in core components (#2383)

* docs: add log for dependency fixes (#2384)

* refactor gossip gossiper (#2385)

* refactor gossip gossiper

* chore: log builder extraction

* test: consolidate gossip tests into cluster suite

* refactor: modularize cluster diagnostics and init (#2386)

* refactor: extract connection runner (#2387)

* chore: log StopActorWithLongRunningTask fix (#2388)

* Add typed log messages for ConnectionRunner (#2390)

* Update AGENTS.md (#2391)

* Add typed log messages for ConnectionRunner (#2389)

* refactor: split endpoint reader and writer (#2392)

* chore: log receive timeout test update (#2393)

* refactor: extract partition helpers (#2394)

* refactor remote stream processing (#2395)

* add remote stream processor tests and log

* docs: log reader completion fix

* docs: add endpoint documentation (#2396)

* chore: document endpoint renames (#2397)

* add diagram (#2398)

* Fix typo in README.md regarding mailbox processing (#2399)

* Allow nullable sender in message envelope (#2277)

* Update KubernetesClient package to version 17.0.14 (#2403)

Fixes a vulnerability issue: https://avd.aquasec.com/nvd/2025/cve-2025-9708/

* Add AI-oriented context guides (#2404)

* auto context

* .

* .

* clean up remote tests (#2405)

* Refine shared future wrap-around regression test (#2407)

* Remove net5 artifacts from remaining projects (#2408)

* docs: document dotnet coverage workflow (#2409)

* Added optional ClusterIdentity parameter to IMemberStrategy.GetActivator (#2410)

* Added optional ClusterIdentity parameter to IMemberStrategy.GetActivator

* Refactor / cleanup / tag obsolete

---------

Co-authored-by: wuuer <[email protected]>
Co-authored-by: Niclas Pehrsson <[email protected]>
Co-authored-by: Magne Helleborg <[email protected]>
Co-authored-by: Amir Shitrit <[email protected]>
Co-authored-by: Vlad Dev <[email protected]>
Co-authored-by: Justin LeFebvre <[email protected]>
Co-authored-by: Ben Wilde <[email protected]>
Co-authored-by: Fraye <[email protected]>
Co-authored-by: mugu-1 <[email protected]>
Co-authored-by: Tyrone Groves <[email protected]>
Co-authored-by: Tyrone Groves <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Sipke Schoorstra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants