Skip to content

Conversation

@evanh
Copy link
Member

@evanh evanh commented May 15, 2025

This change creates a new SQLite DB that stores the TaskActivation protobuf, keyed by ID. This table
is used to hydrate the protobuf when it is needed in the taskbroker. The original table remains,
however it won't contain the protobuf data. In order to ensure that processing could happen without
needing the protobuf itself, the on_attempts_exceeded value needed to be pulled out of the protobuf
and stored in the table directly.

There are now extra functions for fetching just the activation data, get_activation_blob and
get_activation_raw_blob. These fetch the decoded and encoded versions of the protobuf
respectively.

There is also a new upkeep function that deletes all the blobs that don't have a corresponding ID in
the meta table. Upkeep works essentially unchanged for the original table. Once a row is deleted
from the meta table, the corresponding blob is considered orphaned. The upkeep function deletes all
such blobs from the table that are older than the given delay_ms.

All of this is gated behind a config option that determines whether just the old table is used or
the new table is used. There are four modes for this migration:

  1. Write only to the original table, and read only from the original table (same as before)
  2. Write to both tables, but read from the original table.
  3. Write to both tables, and read from the new table.
  4. Write only to the new table and read from the new table.

There will be separate PRs moving the code through the different states, which allows for easy
rollbacks in case of problems. Since the new column added to the original table is nullable, this PR
can be rolled back safely as well.

/// Remove orphaned blobs that don't have a corresponding ID in the meta table.
/// This method is a garbage collector for the blob store.
#[instrument(skip_all)]
pub async fn remove_orphaned_blobs(&self) -> Result<u64, Error> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call this from the maintenance thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add this to the main upkeep for now. We can change that if it's a problem.

@evanh evanh force-pushed the evanh/feat/separate-activations-into-another-table branch from 27a419f to 4fa2eca Compare May 16, 2025 19:06
@evanh evanh changed the title evanh/feat/separate activations into another table feat(sqlite): Split activation blobs to a separate DB May 16, 2025
@codecov
Copy link

codecov bot commented May 16, 2025

Codecov Report

Attention: Patch coverage is 93.65672% with 34 lines in your changes missing coverage. Please review.

Project coverage is 87.71%. Comparing base (a097bae) to head (fc2ce64).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/store/inflight_activation.rs 92.52% 13 Missing ⚠️
src/grpc/server.rs 72.72% 9 Missing ⚠️
src/upkeep.rs 94.21% 7 Missing ⚠️
src/config.rs 69.23% 4 Missing ⚠️
src/main.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #369      +/-   ##
==========================================
+ Coverage   87.53%   87.71%   +0.18%     
==========================================
  Files          20       20              
  Lines        4628     4926     +298     
==========================================
+ Hits         4051     4321     +270     
- Misses        577      605      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@evanh evanh force-pushed the evanh/feat/separate-activations-into-another-table branch 2 times, most recently from 109ed15 to 56b1825 Compare May 16, 2025 19:13
This change creates a new SQLite DB that stores the TaskActivation protobuf, keyed by ID. This table
is used to hydrate the protobuf when it is needed in the taskbroker. The original table remains,
however it won't contain the protobuf data. In order to ensure that processing could happen without
needing the protobuf itself, the on_attempts_exceeded value needed to be pulled out of the protobuf
and stored in the table directly.

There are now extra functions for fetching just the activation data, `get_activation_blob` and
`get_activation_raw_blob`. These fetch the decoded and encoded versions of the protobuf
respectively.

There is also a new upkeep function that deletes all the blobs that don't have a corresponding ID in
the meta table. Upkeep works essentially unchanged for the original table. Once a row is deleted
from the meta table, the corresponding blob is considered orphaned. The upkeep function deletes all
such blobs from the table that are older than the given delay_ms.

All of this is gated behind a config option that determines whether just the old table is used or
the new table is used. There are four modes for this migration:
1. Write only to the original table, and read only from the original table (same as before)
2. Write to both tables, but read from the original table.
3. Write to both tables, and read from the new table.
4. Write only to the new table and read from the new table.

There will be separate PRs moving the code through the different states, which allows for easy
rollbacks in case of problems. Since the new column added to the original table is nullable, this PR
can be rolled back safely as well.
@evanh evanh force-pushed the evanh/feat/separate-activations-into-another-table branch from 56b1825 to 88ae998 Compare May 16, 2025 19:16
@evanh evanh marked this pull request as ready for review May 16, 2025 19:16
@evanh evanh requested a review from a team as a code owner May 16, 2025 19:16
Comment on lines +704 to +707
sqlx::query("DELETE FROM taskactivation_blobs WHERE id = $1")
.bind(id)
.execute(&self.blob_write_pool)
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need an if self.is_new_write() guard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, the table will exist regardless of the guards. The guards just determine what is in the table.

Copy link
Contributor

@enochtangg enochtangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff! This is quite a fundamental change so we should do thorough testing in the sandbox before we consider pushing to production


match inflight {
Ok(Some(inflight)) => {
let result = self.store.get_activation_blob(&inflight.id).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that the get_activation_blob path is gated by BlobTableMode, but doing this will double our reads during the transitional phase. If so, I'd imagine we would see throughput hits in our testing.

.await;
let activation = match result {
Ok(activation) => activation,
Err(e) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here as GetTask?

b.push_bind(row.activation);
b.push_bind(row.added_at.timestamp());
})
.push(" ON CONFLICT(id) DO NOTHING")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note on ON CONFLICT(id) DO NOTHING since I think it has an interesting implication down the road. We know that it's possible for writes to the blob db to succeed and writes to meta db to fail. If a batch fails and is retried later, the blob will not be updated since the ID already exists. This seems ok since the blob doesn't store any information that gets altered by inflight_activation_writer when it gets retried.

@evanh
Copy link
Member Author

evanh commented May 23, 2025

Closing this as it isn't showing any improvements in the sandbox. Could be revisited later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants