Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BlockingBuffer.bufferUsage metric does not include records in-flight #3936

Closed
dlvenable opened this issue Jan 9, 2024 · 4 comments · Fixed by #3937
Closed

[BUG] BlockingBuffer.bufferUsage metric does not include records in-flight #3936

dlvenable opened this issue Jan 9, 2024 · 4 comments · Fixed by #3937
Assignees
Labels
bug Something isn't working
Milestone

Comments

@dlvenable
Copy link
Member

Describe the bug

The BlockingBuffer.bufferUsage metric is inaccurate. It indicates that it is the percentage of the buffer used. However, it is only the percentage of the buffer used for messages that are waiting. It does not include in-flight messages.

Expected behavior

I expect this metric to include both the records waiting and the records-in-flight when calculating the usage.

Screenshots

We sent these metrics to AWS CloudWatch. You can see in the screenshot below that the sum of recordsInBuffer and recordsInFlight (the line in blue) reaches near the maximum defined size of 1,000,000 records. However, the bufferUsage metric is around 70% at the time that happens.

BufferMetrics

Additional context

This can lead to confusion when trying to see why events are not writing to the buffer. The metrics indicate that there is capacity, but we fail to write to the buffer due to a timeout exception.

You can see that below.

final boolean permitAcquired = capacitySemaphore.tryAcquire(size, timeoutInMillis, TimeUnit.MILLISECONDS);
if (!permitAcquired) {
throw new TimeoutException(
format("Pipeline [%s] - Buffer does not have enough capacity left for the number of records: %d, " +
"timed out waiting for slots.",
pipelineName, size));

This code is what appears to be incorrect. It does not account for the records in-flight.

final Double nonNegativeTotalRecords = recordsInBuffer.doubleValue() < 0 ? 0 : recordsInBuffer.doubleValue();
final Double boundedTotalRecords = nonNegativeTotalRecords > bufferCapacity ? bufferCapacity : nonNegativeTotalRecords;
final Double usage = boundedTotalRecords / bufferCapacity * 100;
bufferUsage.set(usage);

@dlvenable dlvenable added bug Something isn't working untriaged labels Jan 9, 2024
@dlvenable dlvenable added this to the v2.6.2 milestone Jan 9, 2024
@dlvenable dlvenable self-assigned this Jan 9, 2024
@dlvenable
Copy link
Member Author

An alternative solution is to also update the recordsInBuffer metric to more accurately reflect that it is the totality of all records. Then we could have a recordsWaitingInBuffer.

Or we could add a new metric that expresses both: recordsTotalInBuffer. This would be equal to the sum of both recordsInBuffer and recordsInFlight.

@graytaylor0
Copy link
Member

I think there is still value in knowing the number of recordsInBuffer from a debugging standpoint

@kkondaka
Copy link
Collaborator

kkondaka commented Jan 9, 2024

As discussed offline, I think it is better to have a metric that directly corresponds to the semaphore count that controls the capacity

@dlvenable
Copy link
Member Author

I'm thinking of making two changes:

  1. Create a new gauge metric on the semaphore's availablePermits() to track the actual buffer capacity - BlockingBuffer.capacityUsed.
  2. Replace the logic for creating the existing bufferUsage such that it is effectively availablePermits() / bufferCapacity.

dlvenable added a commit to dlvenable/data-prepper that referenced this issue Jan 10, 2024
…between the bufferCapacity and the available permits in the semaphore. Adds a new capacityUsed metric which tracks the actual capacity used by the semaphore which blocks. Resolves opensearch-project#3936.

Signed-off-by: David Venable <[email protected]>
dlvenable added a commit that referenced this issue Jan 10, 2024
…between the bufferCapacity and the available permits in the semaphore. Adds a new capacityUsed metric which tracks the actual capacity used by the semaphore which blocks. Resolves #3936. (#3937)

Signed-off-by: David Venable <[email protected]>
@github-project-automation github-project-automation bot moved this from Unplanned to Done in Data Prepper Tracking Board Jan 10, 2024
opensearch-trigger-bot bot pushed a commit that referenced this issue Jan 10, 2024
…between the bufferCapacity and the available permits in the semaphore. Adds a new capacityUsed metric which tracks the actual capacity used by the semaphore which blocks. Resolves #3936. (#3937)

Signed-off-by: David Venable <[email protected]>
(cherry picked from commit d61b0c5)
dlvenable added a commit that referenced this issue Jan 10, 2024
…between the bufferCapacity and the available permits in the semaphore. Adds a new capacityUsed metric which tracks the actual capacity used by the semaphore which blocks. Resolves #3936. (#3937) (#3940)

Signed-off-by: David Venable <[email protected]>
(cherry picked from commit d61b0c5)

Co-authored-by: David Venable <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

3 participants