Skip to content

Conversation

@alexbrt
Copy link
Contributor

@alexbrt alexbrt commented Jul 25, 2025

Azure Application Insights frequently times out, returns HTTP 503 errors, or only accepts partial payloads. This PR adds support for configuring a backon-based backoff strategy to automatically retry telemetry uploads in such cases.

Copy link
Owner

@frigus02 frigus02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into this. This is great.

@frigus02 frigus02 linked an issue Jul 25, 2025 that may be closed by this pull request
@alexbrt alexbrt force-pushed the main branch 2 times, most recently from 995ab29 to 6d833f0 Compare July 30, 2025 10:46
@frigus02
Copy link
Owner

frigus02 commented Aug 4, 2025

Sorry for the slow replies. I was pretty swamped the last days. I'll try to get to this later today or latest on Wednesday.

Copy link
Owner

@frigus02 frigus02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this overall. I think a test would be great if possible. Though I'm not sure if the test framework allows that currently. I'm happy to take a stab at that over the weekend.

@frigus02
Copy link
Owner

frigus02 commented Aug 9, 2025

Added tests. I think this is good to merge now. I'll try to release a new version over the weekend

@frigus02 frigus02 merged commit 68348f9 into frigus02:main Aug 9, 2025
2 checks passed
frigus02 added a commit that referenced this pull request Aug 9, 2025
frigus02 added a commit that referenced this pull request Aug 9, 2025
Azure Application Insights frequently times out, returns HTTP 503
errors, or only accepts partial payloads. This adds transparent retries.
Users of this crate won't have to configure anything.

Using `backon` crate, because it seems small and easy to use.

Design decisions:

- Don't expose implementation. Add default retries that should work for
  everyone:

  - min_delay = 500ms
  - max_delay = 5s
  - with jitter (to avoid thundering herds when multiple exports run in parallel)

  We wanted no max_times or total delay, because the BatchSpanProcessor
  already provides a `max_export_timeout` option. This automatically
  ensures that the exporter will stop. However, it's only available in the
  experimental async version of the BatchSpanProcessor. And it's missing
  in the SimpleSpanProcessor and metric and logs processors. Therefore
  this sets a total delay of 35s.

- Add a `with_retry_notify` option. Can be used to log retries and
  thereby debug exporter flakyness.

- Use `futures-timer` crate to provide sleep functionality during
  retries. Ideally we could use
  opentelemetry_sdk::runtime::Runtime::delay for that. But that's behind
  an experimental feature and we the SDK doesn't provide the exporters
  with a runtime. We'd have to add a configuration to the exporter where
  users would need to specify their choosen runtime, which then has to
  fit to the rest of their setup, otherwise they see runtime errors.

  E.g. tokio-sleep fails if not executed in the context of a Tokio
  reactor with:

      there is no reactor running, must be called from the context of a Tokio 1.x runtime

  This error would also only happen if a retry has to happen, which
  means when a call to AppInsights fails. This would make it very hard
  to debug.

  We're using futures-timer instead, which works with all runtimes,
  including tokio and futures-executor. The OpenTelemetry SDK uses the
  latter for most default processors right now.

---------

Co-authored-by: Jan Kuehle <[email protected]>
@frigus02
Copy link
Owner

frigus02 commented Aug 9, 2025

Published with version 0.42.0.

I hope I didn't remove anything you needed in the commits today. If I did, please let me know. And either way, thanks a lot for getting this started! This has been on my todo list for far too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retry behaviour

2 participants