Skip to content

Conversation

NickGoog
Copy link

@NickGoog NickGoog commented Aug 5, 2025

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Adds zonal buckets option for Google Cloud Storage.

Need to update the docs and objstore version in the main repo:

Verification

Wrote tests and manually tested. See below (performance comparison at bottom!):

Resource setup (uses some internal Google commands that I'll demarcate)

# Customize these for your setup.
export PROJECT=
export REGULAR_BUCKET=
export RAPID_BUCKET=
export REGULAR_VM=
export RAPID_VM=
export FAKE_THANOS_DATA_PATH=
export SCRATCH_WORKSPACE=

# gcloud CLI needs to stick to Rapid-allowlisted project.
gcloud config set project $PROJECT

# Bucket creation.
gcloud storage buckets create gs://$REGULAR_BUCKET
# {Internal command for creating a Rapid bucket. Speak to rep.}

# Generate a year of Thanos metrics data.
sudo apt-get install -y git make
git clone https://github.com/thanos-io/thanosbench.git
cd thanosbench
make build
./thanosbench block plan -p continuous-365d-tiny --max-time=6h | ./thanosbench block gen --output.dir $FAKE_THANOS_DATA_PATH --workers 20

# Copy test data to buckets.
gcloud storage cp -r $FAKE_THANOS_DATA_PATH/* gs://$REGULAR_BUCKET
# {Internal command for uploading to Rapid bucket. Speak to rep.}

# VM creation.
gcloud compute instances create $REGULAR_VM --zone=us-central1-b --subnet=default --machine-type=c4d-standard-16 --project=$PROJECT
gcloud compute instances create $RAPID_VM --zone=us-central1-b --subnet=default --machine-type=c4d-standard-16 --project=$PROJECT

Regular VM login

export PROJECT=
export REGULAR_VM=

# From here on, we'll be inside the VM.
gcloud compute ssh $REGULAR_VM --zone "us-central1-b" --project $PROJECT

Regular VM setup

# Refresh env vars and add more.
export BUCKET=
export OBJSTORECFG_FILE=$PWD/regular-objstore-config.yml
export PROMETHEUS_EXECUTABLE=$PWD/prometheus-3.5.0.linux-amd64/prometheus
export THANOS_EXECUTABLE=thanos
export GCS_BUCKET=value-does-not-matter
export REMOTE_WRITE_ENABLED=do-it

# Establish credentials. Expect interactive browser steps.
gcloud auth login
gcloud auth application-default login

# Thanos config.
cat <<EOF >$OBJSTORECFG_FILE
type: GCS
config:
  bucket: "$BUCKET"
EOF

# Get Prometheus binary (Thanos is built on Prometheus).
wget https://github.com/prometheus/prometheus/releases/download/v3.5.0/prometheus-3.5.0.linux-amd64.tar.gz
tar -xzvf [prometheus-3.5.0.linux-amd64.tar.gz](http://prometheus-3.5.0.linux-amd64.tar.gz/)

# Install Go.
wget https://dl.google.com/go/go1.22.3.linux-amd64.tar.gz
tar -xvf go1.22.3.linux-amd64.tar.gz
sudo mv go /usr/local
export GOROOT=/usr/local/go
export GOPATH=$HOME/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH

# Build latest version of Thanos from source.
sudo apt-get install -y git make
git clone https://github.com/thanos-io/thanos.git
cd thanos
make build

# Start Thanos servers.
scripts/quickstart.sh

SSH tunnel to view Prometheus from browser at localhost:9090. Run in new terminal tab.

export PROJECT=
export REGULAR_VM=

gcloud compute ssh $REGULAR_VM --zone "us-central1-b" --project $PROJECT --ssh-flag="-L 9090:localhost:9090"

Zonal bucket VM login

export PROJECT=
export RAPID_VM=

# From here on, we'll be inside the VM.
gcloud compute ssh $RAPID_VM --zone "us-central1-b" --project $PROJECT

Zonal bucket VM setup

# Refresh env vars and add more.
export BUCKET=
export OBJSTORECFG_FILE=$PWD/grpc-objstore-config.yml
export PROMETHEUS_EXECUTABLE=$PWD/prometheus-3.5.0.linux-amd64/prometheus
export THANOS_EXECUTABLE=thanos
export GCS_BUCKET=value-does-not-matter
export REMOTE_WRITE_ENABLED=do-it

# Establish credentials. Expect interactive browser steps.
gcloud auth login
gcloud auth application-default login

# Thanos config.
cat <<EOF >$OBJSTORECFG_FILE
type: GCS
config:
  bucket: "$BUCKET"
  use_grpc: true
  use_zonal_buckets: true
EOF

# Get Prometheus binary (Thanos is built on Prometheus).
wget https://github.com/prometheus/prometheus/releases/download/v3.5.0/prometheus-3.5.0.linux-amd64.tar.gz
tar -xzvf prometheus-3.5.0.linux-amd64.tar.gz

# Install Go.
wget https://dl.google.com/go/go1.22.3.linux-amd64.tar.gz
tar -xvf go1.22.3.linux-amd64.tar.gz
sudo mv go /usr/local
export GOROOT=/usr/local/go
export GOPATH=$HOME/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH

# Prepare objstore dependency for Thanos that adds zonal buckets support.
sudo apt-get install -y git make
git clone https://github.com/NickGoog/objstore.git
cd objstore
git checkout zonalBucketsUpdate
cd ..

# Prepare latest version of Thanos from source.
git clone https://github.com/thanos-io/thanos.git
cd thanos
# Add custom objstore dependency with zonal buckets update.
cat <<EOF >>go.mod

replace (
        github.com/thanos-io/objstore => ../objstore
)
EOF
# Finish building Thanos.
make build

# Start Thanos servers.
scripts/quickstart.sh

SSH tunnel to view Prometheus from browser at localhost:9090. Run in new terminal tab.

export PROJECT=
export RAPID_VM=

gcloud compute ssh $RAPID_VM --zone "us-central1-b" --project $PROJECT --ssh-flag="-L 9091:localhost:9090"

Cleanup

export REGULAR_BUCKET=
export RAPID_BUCKET=
export REGULAR_VM=
export RAPID_VM=

# Bucket contents (remove asterisk to delete buckets as well).
gcloud storage rm -r gs://$REGULAR_BUCKET/*
gcloud storage rm -r gs://$RAPID_BUCKET/*

# Stop VM (replace "stop" with "delete" to remove VMs).
gcloud compute instances stop $REGULAR_VM --zone=us-central1-b
gcloud compute instances stop $RAPID_VM --zone=us-central1-b

Taking measurements quickly after startup…

Regular

  • thanos_bucket_store_block_load_duration_seconds_sum
    • 9.53
    • sum(thanos_bucket_store_block_load_duration_seconds_sum) / sum(thanos_bucket_store_block_load_duration_seconds_count)
      • 0.73
    • histogram_quantile(0.5, thanos_bucket_store_block_load_duration_seconds_bucket)
      • 0.75
  • thanos_bucket_store_indexheader_download_duration_seconds_sum
    • 8.94
    • sum(thanos_bucket_store_indexheader_download_duration_seconds_sum) / sum(thanos_bucket_store_indexheader_download_duration_seconds_count)
      • 0.69
    • histogram_quantile(0.5, thanos_bucket_store_indexheader_download_duration_seconds_bucket)
      • 0.75
        • Yes, identical to the above metric. Strange.
  • thanos_blocks_meta_sync_duration_seconds_sum
    • 1.05
    • sum(thanos_blocks_meta_sync_duration_seconds_sum) / sum(thanos_blocks_meta_sync_duration_seconds_count)
      • 0.52
    • histogram_quantile(0.5, thanos_blocks_meta_sync_duration_seconds_bucket)
      • 0.51

Rapid

  • thanos_bucket_store_block_load_duration_seconds_sum
    • 4.06
    • sum(thanos_bucket_store_block_load_duration_seconds_sum) / sum(thanos_bucket_store_block_load_duration_seconds_count)
      • 0.31
    • histogram_quantile(0.5, thanos_bucket_store_block_load_duration_seconds_bucket)
      • 0.35
  • thanos_bucket_store_indexheader_download_duration_seconds_sum
    • 3.44
    • sum(thanos_bucket_store_indexheader_download_duration_seconds_sum) / sum(thanos_bucket_store_indexheader_download_duration_seconds_count)
      • 0.26
    • histogram_quantile(0.5, thanos_bucket_store_block_load_duration_seconds_bucket)
      • 0.35
  • thanos_blocks_meta_sync_duration_seconds_sum
    • 1.27
    • sum(thanos_blocks_meta_sync_duration_seconds_sum) / sum(thanos_blocks_meta_sync_duration_seconds_count)
      • 0.64
    • histogram_quantile(0.5, thanos_blocks_meta_sync_duration_seconds_bucket)
      • 1.00
Metric Sum Average Median
Default Bucket Zonal Bucket Default Bucket Zonal Bucket Default Bucket Zonal Bucket
thanos_bucket_store_block_load_duration_seconds 9.53 4.06 0.73 0.31 0.75 0.35
thanos_bucket_store_indexheader_download_duration_seconds 8.94 3.44 0.69 0.26 0.75 0.35
thanos_blocks_meta_sync_duration_seconds 1.05 1.27 0.52 0.64 0.51 1.00

The above metrics cover downloading the small index headers for each of the thirteen seed blocks. I also added some debug logs to the compact command to confirm the download is running and print the latency.


VM setup tweak:

# Prepare objstore dependency for Thanos that adds zonal buckets support and logs.
sudo apt-get install -y git make
git clone https://github.com/NickGoog/objstore.git
cd objstore
# ***** This is the main difference. *****
git checkout WithLogs
cd ..

# Prepare latest version of Thanos from source.
git clone https://github.com/thanos-io/thanos.git
cd thanos
# Add custom objstore dependency with zonal buckets update.
cat <<EOF >>go.mod

replace github.com/thanos-io/objstore => ../objstore
EOF

Run these commands in a separate terminal SSH’d into the VM:


export OBJSTORECFG_FILE=
export GOROOT=/usr/local/go
export GOPATH=$HOME/go
export PATH=$GOPATH/bin:$GOROOT/bin:$PATH

thanos compact --data-dir=./data --objstore.config-file=$OBJSTORECFG_FILE

ts=2025-08-07T15:52:13.281472855Z caller=factory.go:39 level=info msg="loading bucket configuration"
...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DownloadFile01K1X77T5JNMME3JSMNRGNZZBN/chunks/000001=(MISSING)
ts=2025-08-07T15:52:14.775661453Z caller=objstore.go:482 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!@/u簊fA['zɘu~֤/ufA['zɘuGL7ufA['z
...
4P/u٫fAyu2MfAyu=(MISSING)
ts=2025-08-07T15:52:14.776513662Z caller=objstore.go:482 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!DownloadFileEND01K1X77T5JNMME3JSMNRGNZZBN/chunks/000001103.304407ms=(MISSING)
ts=2025-08-07T15:52:14.776534934Z caller=downsample.go:366 level=info msg="downloaded block" id=01K1X77T5JNMME3JSMNRGNZZBN duration=527.930579ms duration_ms=527

Walkthrough:

  1. Set the ObjStore config (either for regular or Rapid).
  2. Pull Go into PATH so we can use thanos commands.
  3. Run compact command that downloads blocks (our command errors after the first block because it’s fake test data, but the one download is enough to illustrate performance).
  4. Skip irrelevant logs.
  5. See the log I added that prints the binary contents of the downloaded block file to confirm the download actually happened. Sadly, the readability spaces I added were automatically removed.
  6. See the download time log I added, “000001103.304407ms”. The 1.2 KB block file is named “000001”. Sadly, because spaces were removed, this blends with the download time. However, separating these, we get a time of 103.304407ms.
  7. I also included the built-in duration log here: 527.930579ms.
    Rerunning in the regular and Rapid VM three times, I obtained:

Regular

  • 94.741372ms
  • 99.844454ms
  • 96.193848ms

Rapid

  • 34.926639ms
  • 37.741793ms
  • 30.469198ms

Known issues

@NickGoog NickGoog force-pushed the zonalBucketsUpdate branch from 46790a7 to b8ab9d8 Compare August 5, 2025 19:24
@NickGoog NickGoog marked this pull request as draft August 5, 2025 19:26
@NickGoog NickGoog force-pushed the zonalBucketsUpdate branch 2 times, most recently from 7d6b044 to c8ed35e Compare August 5, 2025 19:31
@NickGoog NickGoog marked this pull request as ready for review August 5, 2025 19:32
@bwplotka bwplotka requested a review from fpetkovski August 7, 2025 09:12
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic, generally LGTM, thanks for amazing repro scripts on benchmarks 💪🏽

I would change commentary slightly with more details and actually add item to changelog.

Avoiding updating CHANGELOG until after main repo update.

Why? We have changelog in this repo too, there are at least 2 more big users of this library. (:

Comment on lines +46 to +48
// Both upload and download paths will use zonal bucket gRPC APIs by default.
// Zonal buckets are currently allowlisted; please contact your Google account
// manager if interested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Both upload and download paths will use zonal bucket gRPC APIs by default.
// Zonal buckets are currently allowlisted; please contact your Google account
// manager if interested.
// UseZonalBuckets enables Zonal gRPC APIs to be used to access (both writes and reads) GCS storage. You can read more about zonal buckets <here>
//
// Initial benchmarks shown avg. ~50+% reduction of latency for larger Thanos store operations (see: https://github.com/thanos-io/objstore/pull/209#issue-3294127783).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the following info in some public doc and link it from here. Otherwise this comment might age very fast.

Zonal buckets are currently in Google Private Preview; please contact your Google account
manager if interested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written in suggestion, I would also ignore UseGRPC if zonal is enabled - might be easier for user - not a strong opinion though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - did I remember right that zonal buckets don't have redundancy? Should we mention this detail in the comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to all the above. Let's maybe add links to docs for this, so people know what zonal buckets are.

Also, maybe we add some NOTE comment on the field with the consequence of using this (eg higher pricing, redundancy?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the following info in some public doc and link it from here
+1 to all the above. Let's maybe add links to docs for this

I agree. I'd normally do this, but we haven't launched yet, so there are no public docs. Should I put the PR on ice until then? (Not allowed to say when launch is, sadly, but you can probably tell we’re quite far along.)

As written in suggestion, I would also ignore UseGRPC if zonal is enabled

Keeping this toggle is useful for future non-GRPC support

Also - did I remember right that zonal buckets don't have redundancy?

We don’t support Soft Delete, which allows you to restore deleted files within certain restraints: https://cloud.google.com/storage/docs/soft-delete

The other redundancy features that come to mind are dual-region and multi-region buckets: https://cloud.google.com/storage/docs/locations#considerations (these docs also mention one-region buckets have failover between zones within a region)

In contrast to these settings, zonal buckets are tied to a single zone, geographically smaller than a region. Part of the performance comes from the data being closer to where the user needs it. I agree that linking docs would be the best way to explain this, but I’ll leave a TODO for now.

Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This looks good already minus the doc comments.
Might also be worth committing the benchmarks scripts somehow?

Comment on lines +46 to +48
// Both upload and download paths will use zonal bucket gRPC APIs by default.
// Zonal buckets are currently allowlisted; please contact your Google account
// manager if interested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to all the above. Let's maybe add links to docs for this, so people know what zonal buckets are.

Also, maybe we add some NOTE comment on the field with the consequence of using this (eg higher pricing, redundancy?)

@NickGoog
Copy link
Author

NickGoog commented Aug 7, 2025

Thanks for the review! Switching to draft mode while I investigate better demonstration of performance improvement.

@NickGoog NickGoog marked this pull request as draft August 7, 2025 16:33
@NickGoog NickGoog force-pushed the zonalBucketsUpdate branch from c8ed35e to 4603b4f Compare August 8, 2025 16:23
@NickGoog
Copy link
Author

NickGoog commented Aug 8, 2025

I've updated the PR description with more testing!

Why? We have changelog in this repo too

Done. Updated the changelog.

Might also be worth committing the benchmarks scripts somehow?

I was thinking about this as well, but a number of concerns held me back:

  • We don’t have a public CLI for creating zonal buckets and copying to them yet.
    • Solution: Don’t include zonal buckets in the script or wait until the tools are available.
  • Although I fixed a number of bugs in the quickstart script (e.g. Update thanos query flag --store to --endpoint. thanos#8400), some still exist. For example, the script claims to run three receive components, but only one seems to work.
    • Solution: Fix all the things—I tried this and unfortunately couldn’t get it done in a reasonable time frame. Maybe someone else can? We could also just add a caveat to the docs that one receive component is enough to demonstrate the improved download time.
  • Cost implications of running VMs and zonal buckets.
    • Solution: Tell people to be aware of costs in the docs. Might want to exclude zonal buckets from the script for this reason too because the cost isn’t public yet.
  • My custom Git branches will eventually go away.
    • Solution: Merge to main and remove the custom Git branch steps from the script.
  • Prometheus version is hardcoded.
    • Solution: Build from source.
  • Script doesn’t actually measure anything by itself (it creates the environment for quickstart.sh to work with GCS).
    • Solution: Automate running thanos compact and measuring download speed in the script.
  • Script currently generates otel-plugin log spam (originating from Go gRPC package).
    • Solution: I spent a few hours on this but couldn’t solve. Our Go experts are looking into it, though. I’ve added a “Known issues” section to the PR description.

The two most appealing solutions to me are:

@NickGoog NickGoog marked this pull request as ready for review August 8, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants