feat(backend): postgres integration #12379

kaikaila · 2025-10-19T04:47:24Z

Summary

This PR adds full PostgreSQL (pgx driver) support to Kubeflow Pipelines backend, enabling users to choose between MySQL and PostgreSQL as the metadata database. The implementation introduces a clean dialect abstraction layer and includes a major query optimization that benefits both database backends.

Key achievements
✅ Complete PostgreSQL integration for API Server and Cache Server, addressing #7512, #9813
✅ All CI tests passing (MySQL + PostgreSQL).
✅ Significant performance improvement for ListRuns queries. This PR is expected to address the root causes behind #10778, #10230, #9780, #9701
✅ Zero breaking changes - backward compatible with existing MySQL deployments

What Changed

Storage Layer Refactoring - Dialect Abstraction ([backend/src/apiserver/common/sql/dialect]

Problem
SQL syntax was tightly coupled to MySQL.
Solution
Introduced a DBDialect interface that encapsulates database-specific behavior
Identifier quoting (MySQL backticks vs PostgreSQL double quotes)
Placeholder styles (? vs $1, $2, ...)
Aggregation functions (GROUP_CONCAT vs string_agg)
Concatenation syntax (CONCAT() vs ||)
Files
- Core dialect implementation → backend/src/apiserver/common/sql/dialect/dialect.go
- Dialect-aware utility functions → backend/src/apiserver/storage/sql_dialect_util.go
- Reusable filter builders with proper quoting → backend/src/apiserver/storage/list_filters.go

All storage layer code now uses

q := s.dbDialect.QuoteIdentifier
qb := s.dbDialect.QueryBuilder()

This ensures queries work correctly across MySQL, PostgreSQL, and SQLite (for tests).

ListRuns Query Performance Optimization

Problem
The original ListRuns query called addMetricsResourceReferencesAndTasks which performed a 3-layer LEFT JOIN with GROUP BY on all columns, including LONGTEXT fields like PipelineSpecManifest WorkflowSpecManifest etc. This caused slow response times for large datasets.
Solution
Layers 1-3: LEFT JOIN only on PrimaryKeyUUID + aggregated columns (refs, tasks, metrics)
Final layer: INNER JOIN back to run_details to fetch LONGTEXT columns
Performance impact
Eliminates GROUP BY on LONGTEXT columns entirely. Expected substantial performance improvements for deployments with large pipeline specifications, though formal load testing has not yet been conducted.

Deployment Configurations

Production-ready PostgreSQL kustomization → manifests/kustomize/env/platform-agnostic-postgresql/
Local development setup → manifests/kustomize/env/dev-kind-postgresql/
PostgreSQL StatefulSet → manifests/kustomize/third-party/postgresql/

Configuration is symmetric to existing MySQL manifests for consistency.

CI Manifest Overlays

Created CI-specific Kustomize overlays to ensure tests use locally built images from the Kind registry instead of pulling official images from ghcr.io:

Add PostgreSQL CI overlay .github/resources/manifests/standalone/postgresql/
Added kfp-cache-server image override to .github/resources/manifests/standalone/base/kustomization.yaml

Added 2 PostgreSQL-specific CI workflows

V2 API and integration tests (cache enabled/disabled matrix) → api-server-test-Postgres.yml
V1 integration tests → integration-tests-v1-postgres.yml

PostgreSQL tests cover the core cache enabled/disabled matrix.

Local development support

make dev-kind-cluster-pg - Provision Kind cluster with PostgreSQL
Updated README for PostgreSQL setup and debugging, achieving parity with MySQL documentation.

Testing

Unit Tests

23 test files modified/added
New test coverage: dialect_test.go, list_filters_test.go, sql_dialect_util_test.go
All existing tests updated to use dialect abstraction

Integration Tests

✅ V1 API integration tests (PostgreSQL)
✅ V2 API integration tests (PostgreSQL, cache enabled/disabled)
✅ Existing MySQL tests remain green

Migration Guide

For new deployments:
kubectl apply -k manifests/kustomize/env/platform-agnostic-postgresql
For existing MySQL deployments:
No action required. This PR is fully backward compatible.
For local development, to set up the kind cluster with Postgres
make -C backend dev-kind-cluster-pg

This PR continues from #12063.

google-oss-prow · 2025-10-19T04:47:34Z

Hi @kaikaila. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-actions · 2025-10-19T04:47:45Z

🚫 This command cannot be processed. Only organization members or owners can use the commands.

kaikaila · 2025-10-22T05:10:42Z

Currently, both MySQL and PGX setups use the DB superuser for all KFP operations, which is why client_manager.go contains a “create database if not exist” step here.

From a security standpoint, would it be preferable to:

Move DB creation out of the client manager and into the deployment/init phase (i.e. add a manifests/kustomize/third-party/postgresql/base/pg-init-configmap.yaml) and
Introduce a dedicated restricted user for KFP components, limited to the mlpipeline database?

If the team agrees, I can propose a follow-up PR to refactor accordingly.

HumairAK · 2025-10-22T18:56:27Z

I'm fine with this, I don't think it's great that KFP tries to create a database (or a bucket frankly)

fyi @mprahl / @droctothorpe

kaikaila · 2025-10-22T19:47:05Z

Thanks, @HumairAK — totally agree on the security point.
Since this PR is already getting quite heavy, would you be okay if I leave the user permission changes for a separate follow-up PR?

HumairAK · 2025-10-23T14:08:31Z

yes that is fine

kaikaila · 2025-10-24T03:05:32Z

Question about the PostgreSQL test workflow organization

Current situation

The V2 integration tests for PostgreSQL logically belong in a "PostgreSQL counterpart" to legacy-v2-api-integration-tests.yml
However, I didn't want to create a new workflow with "legacy" in the name from day one.
As a temporary solution, I merged them into api-server-test-Postgres.yml
This causes asymmetry with api-server-tests.yml and the workflow has mixed responsibilities.

Question: What's the recommended workflow organization for PostgreSQL tests?

Should I:

a. Create legacy-v2-api-integration-tests-postgres.yml for consistency (even though it's new)?
b. Keep current structure and accept the asymmetry?
c. Refactor both MySQL and PostgreSQL to a unified structure?

Would love guidance on the long-term vision for test workflow organization, especially from @nsingla

backend/src/apiserver/storage/experiment_store.go

backend/src/apiserver/storage/fakes_test.go

shivaylamba · 2025-10-27T22:57:58Z

backend/src/apiserver/common/sql/dialect/dialect.go

@@ -0,0 +1,171 @@
+// Copyright 2025 The Kubeflow Authors


Just a thought - Postgres uses $1,$2... while MySQL uses ?. When query fragments are built separately and concatenated, placeholder indices can be wrong, causing runtime errors or misbound parameters.

Suggestion is to add a check: add unit tests that build complex queries from fragments (including empty optional fragments and IN-lists) and assert both final SQL and the parameter slice order for both dialects.

nickalexander-disco · 2025-10-28T13:38:24Z

Just a note I have been working on similar functionality for Postgres support. I have modified the code so it no longer uses Squirrel and SQL generator but uses native GORM through out, this allows for CamelCase support in Table names in many cases has resulted in significantly simpler queries on the DB. By maintaining CamelCase scenarios where reflection is used to generate predicates functions as is. All test cases are passing and would be great if we could look at potentially merging with the work that has been done here. Through the use of native GORM should also be easier to add support for other databases if required. Currently inline with master branch.

kaikaila · 2025-10-28T18:54:27Z

Hi @nickalexander-disco,

Thanks for your comment.

There are two main reasons why I decided not to use GORM for the store package:

GORM excels at simple CRUD operations but not at complex queries.
The store layer in KFP is dominated by complex query patterns — dynamic SQL generation, multi-table JOINs, subqueries, aggregations, and conditional filters. These cannot be expressed cleanly in GORM without dropping back to raw SQL, which defeats the purpose of using an ORM.
GORM behaves as a black box, making debugging difficult.
In the store layer, we often need full visibility and control over the generated SQL to debug query behavior across PostgreSQL, MySQL, and SQLite. GORM’s abstraction layer hides too much detail, and we can’t reliably reproduce or inspect SQL behavior in CI or integration tests.

Because complex queries are the dominant pattern here, I prioritized consistency of implementation — instead of mixing approaches, all queries in the store package are written using Squirrel, a transparent SQL builder that allows precise control, easy unit testing, and clear debugging.

That said, I'm open to evaluate where GORM provides advantages in practice. You mentioned that you have been working on similar functionality using native GORM which passed all tests. Could you please share:

a concrete branch or PR with your code changes,
CI test results showing that all tests pass. Feel free to reuse CI workflows yamls in this PR.

I’d be happy to review them and consider merging our efforts.

Signed-off-by: kaikaila <[email protected]>

…iveExperiment - Extract repeated subquery SQL into resourceReferenceSubquery variable - Unify code style: consistently use SetMap() throughout - Add detailed comments explaining PostgreSQL $N placeholder handling - Simplify error messages optimization according to sanchesoon's suggestion Signed-off-by: kaikaila <[email protected]>

HumairAK · 2025-10-31T13:45:45Z

.github/workflows/api-server-test-Postgres.yml

+          image_registry: ${{ needs.build.outputs.IMAGE_REGISTRY }}
+          forward_port: "true"
+      - name: Port-forward Postgres
+        run: kubectl -n "$POSTGRES_NAMESPACE" port-forward svc/postgres-service ${{ env.DB_PORT }}:${{ env.DB_PORT }} --address=${{ env.DB_FORWARD_IP }} &


curious why you had to do -address=${{ env.DB_FORWARD_IP }}, by default it is forwarded to localhost which typically resolves to 127.0.0.1, could you not have omitted this and just used localhost? this is what we do for all other port forwards

HumairAK · 2025-10-31T13:52:33Z

.github/workflows/api-server-test-Postgres.yml

@@ -0,0 +1,111 @@
+name: API Server Tests - Postgres


I would like us to explore whether we can leverage the existing api-server-tests.yml, this would allow us to continue to focus api tests in one workflow. Can we also use the ./.github/actions/test-and-report action like the other jobs in api-server-tests.yml, this way we can continue to have good test reports generated for the postgresql case.

If we need to ignore upgrade tests, we can update the action to take in optional --label-filter values

cc @nsingla

HumairAK · 2025-10-31T14:00:38Z

.github/workflows/integration-tests-v1-postgres.yml

@@ -0,0 +1,94 @@
+name: KFP API Integration v1 tests - Postgres


Similar comments as for api-server-test-Postgres.yml, my preference is to consolidate into the api-serer-tests.yml workflow

cc @nsingla

I actually thought about merging the Postgres workflow into the existing API server tests,
but since the matrix strategy there is already quite complex, I kept it separate for now.
Happy to follow whichever direction @HumairAK and @nsingla decide.

HumairAK · 2025-10-31T14:10:26Z

backend/Makefile

 	kubectl -n kubeflow wait --for condition=Available --timeout=10m deployment/mysql
 	kubectl -n kubeflow wait --for condition=Available --timeout=3m deployment/metadata-grpc-deployment

+.PHONY: dev-kind-cluster-pg


I think it would be more intuitive for the user if we just updated dev-kind-cluster target to accept a DATABASE env var (kind of like CONTAINER_ENGINE) and allow us to specify mysql/postgres, and we use mysql by default.

We should do the same for kind-cluster-agnostic target, and update the docs accordingly.

So then I just need to do:

# For dev make -C backend dev-kind-cluster DATABASE=postgres # For users make -C backend kind-cluster-agnostic DATABASE=postgres

@HumairAK Thanks for the suggestion! I agree that dev-kind-clusterwithDATABASE parameter is better than a separate dev-kind-cluster-pg target.

On the other hand, I'd like to clarify the design intent for kind-cluster-agnostic before adding the parameter.

Since kind-cluster-agnostic appears in frontend/README.md but not in backend/README.md, I assume it's intended as a black-box backend for frontend development.

If that's the case, should we expose backend configuration parameters like DATABASE? Currently we don't expose other backend configurations e.g. object store, cache, MLMD version, etc.

3 Possible Approaches

Option 1: No configuration parameters - ⭐️My favorite

Frontend developers just run: make -C backend kind-cluster-agnostic

Always uses sensible defaults to simplify user experience.

Option 2: Expose all backend configuration parameters

Add DATABASE, OBJECT_STORE, CACHE_BACKEND, etc.

If we expose one, we should expose all for consistency.

Option 3: Expose only DATABASE parameter

Question: Why is DATABASE special compared to other backend configs?

That said, I'm happy to implement whichever approach you think best serves the user experience. What's your preference?

HumairAK · 2025-10-31T15:04:09Z

manifests/kustomize/env/dev-kind/forward-local-api-endpoint.yaml

 subsets:
  - addresses:
-      - ip: 172.17.0.1 # docker0 bridge ip
+      - ip: 192.168.65.254 # host.docker.internal IPv4 (for Mac Docker Desktop)


what is the reason for this change? doesn't this break it for linux machines? same question for: manifests/kustomize/env/dev-kind-postgresql/forward-local-api-endpoint.yaml

if we want to offer support for docker desktop, then we can split into a separate overlay?

HumairAK · 2025-10-31T20:32:53Z

backend/src/apiserver/storage/pipeline_store.go

 			query = query.Where(
 				sq.Eq{
-					"pipelines.Namespace": filterContext.ReferenceKey.ID,
+					fmt.Sprintf("%s.%s", q("pipelines"), q("Namespace")): filterContext.ID,


did you mean to change filterContext.ReferenceKey.ID -> filterContext.ID ?

HumairAK · 2025-10-31T20:35:20Z

backend/src/apiserver/storage/pipeline_store.go

+	opts.SetQuote(s.dbDialect.QuoteIdentifier)
+	q := s.dbDialect.QuoteIdentifier
+	qb := s.dbDialect.QueryBuilder()
+	subQuery := qb.Select("t1.pvid, t1.pid").FromSelect(


Did we need additional quotes for the identifiers like t1.pvid / rn / etc.?

HumairAK · 2025-10-31T20:40:06Z

backend/src/apiserver/storage/run_store.go

 	errorF := func(err error) ([]*model.Run, int, string, error) {
 		return nil, 0, "", util.NewInternalServerError(err, "Failed to list runs: %v", err)
 	}
+	opts.SetQuote(s.dbDialect.QuoteIdentifier)


it feels strange to set this here, this is something we need to set for opts based on the dialect for every case right? not just runs? maybe this should be set when we first initialize opts, what do you think?

backend/src/apiserver/storage/run_store.go

HumairAK · 2025-10-31T20:45:48Z

backend/src/apiserver/storage/run_store.go

 // Add a metric as a new field to the select clause by join the passed-in SQL query with run_metrics table.
 // With the metric as a field in the select clause enable sorting on this metric afterwards.
 // TODO(jingzhang36): example of resulting SQL query and explanation for it.
 func (s *RunStore) addSortByRunMetricToSelect(sqlBuilder sq.SelectBuilder, opts *list.Options) sq.SelectBuilder {


as per your earlier comment, this is no longer used, so we can remove this yes?

google-oss-prow · 2025-11-01T05:35:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from humairak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1. dialect.go: fmt.Sprintf for sql string 2. merge review diff from Humair Signed-off-by: kaikaila <[email protected]>

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 19, 2025

google-oss-prow bot requested review from HumairAK, gmfrasca, hbelmiro and mprahl October 19, 2025 04:47

google-oss-prow bot added the needs-ok-to-test label Oct 19, 2025

google-oss-prow bot added the size/XXL label Oct 19, 2025

kaikaila force-pushed the feature/postgres-integration branch 7 times, most recently from cd1d08b to 85498ed Compare October 22, 2025 05:03

kaikaila force-pushed the feature/postgres-integration branch 3 times, most recently from 09fd370 to 1e0caa8 Compare October 23, 2025 07:10

kaikaila force-pushed the feature/postgres-integration branch 6 times, most recently from 4d33821 to e6c943c Compare October 24, 2025 02:47

kaikaila force-pushed the feature/postgres-integration branch 5 times, most recently from 0948a18 to 591d9d6 Compare October 24, 2025 16:32

kaikaila mentioned this pull request Oct 24, 2025

[WIP] chore(backend): Storage-layer refactor & Postgres enablement roadmap #12063

Closed

kaikaila force-pushed the feature/postgres-integration branch from 591d9d6 to 5dd2a8a Compare October 24, 2025 19:19

kaikaila changed the title ~~[WIP]Feature/postgres integration~~ feat(backend): postgres integration Oct 25, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 25, 2025

sanchesoon reviewed Oct 25, 2025

View reviewed changes

backend/src/apiserver/storage/experiment_store.go Outdated Show resolved Hide resolved

sanchesoon reviewed Oct 25, 2025

View reviewed changes

backend/src/apiserver/storage/fakes_test.go Outdated Show resolved Hide resolved

kaikaila force-pushed the feature/postgres-integration branch from 5dd2a8a to 7edde91 Compare October 26, 2025 20:22

shivaylamba reviewed Oct 27, 2025

View reviewed changes

kaikaila force-pushed the feature/postgres-integration branch from 5a4cefc to f68e26c Compare October 28, 2025 00:30

kaikaila force-pushed the feature/postgres-integration branch from adbfbdb to 1a21b29 Compare October 28, 2025 18:05

kaikaila requested a review from sanchesoon October 28, 2025 19:15

kaikaila added 3 commits October 29, 2025 22:24

refactor(backend): integrate postgresql

061038a

Signed-off-by: kaikaila <[email protected]>

ci(pg): add PostgreSQL CI workflows and manifests

2085b2c

Signed-off-by: kaikaila <[email protected]>

kaikaila force-pushed the feature/postgres-integration branch from 1a21b29 to 2362839 Compare October 30, 2025 05:27

HumairAK requested changes Oct 31, 2025

View reviewed changes

google-oss-prow bot assigned HumairAK Oct 31, 2025

Diff since Humair's Oct 31 review:

35d877a

1. dialect.go: fmt.Sprintf for sql string 2. merge review diff from Humair Signed-off-by: kaikaila <[email protected]>

kaikaila force-pushed the feature/postgres-integration branch from cbac1ef to 35d877a Compare November 1, 2025 05:57

		@@ -0,0 +1,94 @@
		name: KFP API Integration v1 tests - Postgres

feat(backend): postgres integration #12379

Are you sure you want to change the base?

feat(backend): postgres integration #12379

Conversation

kaikaila commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Testing

Unit Tests

Integration Tests

Migration Guide

Uh oh!

google-oss-prow bot commented Oct 19, 2025

Uh oh!

github-actions bot commented Oct 19, 2025

Uh oh!

kaikaila commented Oct 22, 2025

Uh oh!

HumairAK commented Oct 22, 2025

Uh oh!

kaikaila commented Oct 22, 2025

Uh oh!

HumairAK commented Oct 23, 2025

Uh oh!

kaikaila commented Oct 24, 2025

Question about the PostgreSQL test workflow organization

Current situation

Question: What's the recommended workflow organization for PostgreSQL tests?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickalexander-disco commented Oct 28, 2025

Uh oh!

kaikaila commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikaila Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

3 Possible Approaches

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kaikaila commented Oct 19, 2025 •

edited

Loading

kaikaila commented Oct 28, 2025 •

edited

Loading

kaikaila Nov 1, 2025 •

edited

Loading