-
Notifications
You must be signed in to change notification settings - Fork 3.9k
xds: xdsClient caches transient error for new watchers #12262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: xdsClient caches transient error for new watchers #12262
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, I really want to kill errorDescription
, as it is just lastError = Status.INVALID_ARGUMENT
. However, some of the fallback code is using errorDescription in really weird ways, so we'd have to clean that up first.
@@ -676,6 +676,8 @@ private final class ResourceSubscriber<T extends ResourceUpdate> { | |||
private ResourceMetadata metadata; | |||
@Nullable | |||
private String errorDescription; | |||
@Nullable | |||
private Status lastError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Change data to StatusOr<T> data
would have kept this a bit more clear. You still have data vs absent, but at least you don't have yet another thing to deal with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that it's a more invasive refactoring change since it would require touching every method that uses the data field. I can try to do it in a follow up PR and keep this PR's scope limited to bug fix?
I'll agree that we are moving to more usage of StatusOr<T>
in our codebase everywhere. We should consider here as well for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only a few other places that access data
. It does look like it'd make all of them more complicated. Most are of the form data != null
, and would likely need to become data != null && data.hasValue()
.
@@ -676,6 +676,8 @@ private final class ResourceSubscriber<T extends ResourceUpdate> { | |||
private ResourceMetadata metadata; | |||
@Nullable | |||
private String errorDescription; | |||
@Nullable | |||
private Status lastError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests are needed for this? I don't know how I replicate this behaviour. The logic sounds well, we should probably let it merge and see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests prevent breakages in the future from new code changes.
There are many tests that test onData and onError and testing watcher calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A unit test shouldn't be too hard. We find a test that is triggering onError()
, and then after getting the onError() add a second subscription.
So as "the first one I noticed" there's ldsResourceUpdated_withXdstpResourceName_withUnknownAuthority().
grpc-java/xds/src/test/java/io/grpc/xds/GrpcXdsClientImplTestBase.java
Lines 680 to 695 in 6462ef9
@Test | |
public void ldsResourceUpdated_withXdstpResourceName_withUnknownAuthority() { | |
String ldsResourceName = | |
"xdstp://unknown.example.com/envoy.config.listener.v3.Listener/listener1"; | |
xdsClient.watchXdsResource(XdsListenerResource.getInstance(), ldsResourceName, | |
ldsResourceWatcher); | |
verify(ldsResourceWatcher).onError(errorCaptor.capture()); | |
Status error = errorCaptor.getValue(); | |
assertThat(error.getCode()).isEqualTo(Code.INVALID_ARGUMENT); | |
assertThat(error.getDescription()).isEqualTo( | |
"Wrong configuration: xds server does not exist for resource " + ldsResourceName); | |
assertThat(resourceDiscoveryCalls.poll()).isNull(); | |
xdsClient.cancelXdsResourceWatch(XdsListenerResource.getInstance(), ldsResourceName, | |
ldsResourceWatcher); | |
assertThat(resourceDiscoveryCalls.poll()).isNull(); | |
} |
You could create a ldsResourceWatcher2
and subscribe to the same resource after the verify(ldsResourceWatcher).onError(errorCaptor.capture());
, and expect verify(ldsResourceWatcher2).onError(any());
. You can also delete unimportant parts of the test, like checking the status description. You might want to verify that the second watcher gets the same Status
, although that could even be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not confident with this unit test since we are not checking the instance to be same. It seems to be getting caught in errorDescription != null
(code link) for both the watchers and not checking the savedError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces lastError
, so there's no other test right now for setting lastError.
I tried moving check for savedError before errorDescription but looks like lastError is not getting set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR introduces lastError
but the existing control paths in other tests do go there and set it. Just set breakpoint and run all tests in that test class and you will see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like none of the tests are stopping at that break point from GrpcXdsClientImplTestBase class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was keeping the break point inside onError where savedError was getting defined. My bad. I made a new commit. PTAL.
…ort) (#12291) Backport of #12262 to v1.75.x. --- Fixes #11672 Co-authored-by: MV Shiva <[email protected]>
Fixes #11672