Skip to content

Eliminate Host.ReportFatalError(), replace by Component Health Reporting #6344

@tigrannajaryan

Description

@tigrannajaryan

The Problem

Currently many components use Host.ReportFatalError() to indicate problems during the startup.

ReportFatalError() is called asynchronously after the component's Start() function returns.

Unfortunately this creates a problem for anyone who wants to know whether the Collector has started successfully or no. It is currently simply impossible to know. You may have a Collector that starts all the pipelines, you see nice output in the log with zero error messages and assume all is good. However, arbitrary time after that the Collector can fail with a fatal error.

There is no point in time when you can be sure that you have a running Collector that is not going to crash the next second due some component calling ReportFatalError() when it pleases so.

This makes the following capability difficult or impossible: to know if the configuration that the Collector is using a good one. This is necessary for a notion of "last known good config" that we want to use when the Collector is reconfigured by config.Provider watchers.

Proposal

I suggest to get rid of Host.ReportFatalError() altogether.

The Start() function must block until it is certain that the component is up and running.

I looked at our usage of Host.ReportFatalError(). Vast majority of the calls are from the failures to start an HTTP Server. This is totally unnecessary. The errors to start an HTTP Server can be reported synchronously from Start().

Of course Start() function should not block the startup indefinitely. However, problems that can happen at an unknown time after Start() is invoked are not startup problems. They are health problems. Such problems should not result in failing the entire Collector. They should result in the component reporting that it is unhealthly.

My proposal is the following:

  1. Deprecate Host.ReportFatalError()
  2. In all components which call Host.ReportFatalError() to report HTTP Server start failure replace it by proper synchronous return of the error from Start(). This is vast majority of uses.
  3. In components which call Host.ReportFatalError() for other reasons carefully analyze the usage. If it is clearly a startup failure that can happen within known limited time (e.g. 10 seconds) then make sure it is blocking the Start() and return an error from Start(). Otherwise report it as an error in the log and indicate component's bad health (using the proposed component health reporting capability).
  4. Change Host.ReportFatalError() to log an error instead of terminating the Collector.
  5. After some graceful period remove Host.ReportFatalError().
  6. Optionally: modify the Collector startup to call Start() functions concurrently to avoid one serializing the blocking operations. We still need to honour the startup sequence of pipelines (exporters->processors->receivers).

Related to #6226

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions