Skip to content

Make latency impact of decision logs predictable #5724

@mjungsbluth

Description

@mjungsbluth

What is the underlying problem you're trying to solve?

The current decision log plugin runs in two variants: unbounded (all decisions are kept) and bound (decisions are discarded if the buffer has an overflow). The actual trade-off is between auditability and availability. However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Describe the ideal solution

If we change the trade-off to auditability vs latency guarantees, a lock-free ring buffer with a fixed size could be used as an alternative to the existing solution. This would limit the used memory in both cases.
In case auditability is favoured, offered chunks would be tried until it can be put in the buffer (this creates back pressure and increases latency). In case low latency is favoured, offered chunks that cannot be placed in the buffer can be discarded.
In both cases, this can be achieved without holding any locks.

Metadata

Metadata

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions