Skip to content

Commit 128aefe

Browse files
committed
Add vector embedding UDF using Qwen3 model
1 parent 4d8973c commit 128aefe

24 files changed

+690
-110
lines changed

.github/workflows/ci.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: ["main"]
6+
pull_request:
7+
8+
jobs:
9+
test:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
14+
- name: Set up Python
15+
uses: actions/setup-python@v5
16+
with:
17+
python-version: "3.11"
18+
19+
- name: Install uv
20+
run: pip install uv
21+
22+
- name: Run tests
23+
env:
24+
HF_HOME: ${{ github.workspace }}/.hf-cache
25+
TRANSFORMERS_CACHE: ${{ github.workspace }}/.hf-cache
26+
run: uv run pytest -m "not slow"

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
__pycache__/
33
.pytest_cache/
44
*.egg-info/
5+
.hf-cache/

README.md

Lines changed: 34 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,47 @@
1-
# ai-server
1+
# databend-aiserver
22

3-
Stage aware Databend UDF server providing storage centric helpers used by the AI
4-
extensions. The initial release ships with:
3+
Databend AI Server exposes UDFs that let Databend SQL query object storage
4+
and call AI/vector helpers inline.
55

6-
- `list_stage_files`: enumerate objects inside an external S3 stage via
7-
[Apache OpenDAL](https://opendal.apache.org/docs/python).
8-
- `read_pdf`: pull down a PDF object from a stage and return its extracted text
9-
as a `STRING`.
10-
- `read_docx`: fetch Microsoft Word files (`.docx`) and expose their textual
11-
content as a `STRING`.
6+
## Available UDFs (prefix `aiserver_`)
7+
- `list_stage_files(stage, limit)` → list external-stage objects (`VARIANT`).
8+
- `read_pdf(stage, path)` → fetch PDF text (`STRING`).
9+
- `read_docx(stage, path)` → fetch DOCX text (`STRING`).
10+
- `vector_embed_text_1024(model, text)` → 1024-dim embeddings; batch-friendly (`ARRAY(FLOAT NULL)`).
1211

13-
## Getting started
12+
Currently supported embedding alias: `qwen` (maps to Hugging Face
13+
`Qwen/Qwen3-Embedding-0.6B`). Models cache under `.hf-cache/` and run on
14+
GPU/MPS/CPU automatically.
15+
16+
## Quickstart
1417

1518
```bash
16-
python3 -m venv .venv
17-
source .venv/bin/activate
18-
pip install -e .[dev]
19-
ai-udf-server --port 8815 --metrics-port 9091
19+
uv sync
20+
uv run databend-aiserver --port 8815
2021
```
2122

22-
Databend can now connect to the running Flight server and call the registered
23-
functions. `list_stage_files` expects a `STAGE_LOCATION` argument supplied by
24-
Databend plus a numeric limit. Example:
23+
## Example SQL workflow
2524

2625
```sql
27-
SELECT
28-
list_stage_files(
29-
@stage_location,
30-
50
31-
);
26+
-- Inspect files living in an external stage.
27+
SELECT *
28+
FROM aiserver_list_stage_files(@docs_stage, 20);
29+
30+
-- Pull a document into SQL text processing.
31+
WITH pdf AS (
32+
SELECT aiserver_read_pdf(@docs_stage, 'reports/2025/q1.pdf') AS body
33+
)
34+
SELECT * FROM pdf;
35+
36+
-- Pipe DOCX text into embeddings (batch via table column).
37+
SELECT id,
38+
aiserver_vector_embed_text_1024('qwen', doc_body) AS embedding
39+
FROM company_docs;
3240
```
3341

34-
To retrieve file contents call the document readers with an explicit path. The
35-
PDF reader optionally accepts `NULL` for the `max_pages` argument to stream the
36-
entire document:
42+
## Tests
3743

38-
```sql
39-
SELECT read_pdf(@stage_location, 'inbox/manual.pdf', NULL);
40-
SELECT read_docx(@stage_location, 'reports/summary.docx');
44+
```bash
45+
uv run pytest # runs slow embedding tests too
46+
uv run pytest -m "not slow" # skip Hugging Face downloads
4147
```

ai_server/__init__.py

Lines changed: 0 additions & 5 deletions
This file was deleted.

ai_server/server.py

Lines changed: 0 additions & 36 deletions
This file was deleted.

ai_server/udfs/__init__.py

Lines changed: 0 additions & 6 deletions
This file was deleted.

databend_aiserver/__init__.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright 2025 Databend Labs
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""AI-enhanced Databend UDF server utilities."""
16+
17+
from .server import create_server
18+
19+
__all__ = ["create_server"]

ai_server/main.py renamed to databend_aiserver/main.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
# Copyright 2025 Databend Labs
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
115
"""Command line entrypoint for the Databend AI UDF server."""
216

317
from __future__ import annotations
@@ -11,7 +25,7 @@
1125

1226
from prometheus_client import start_http_server as start_prometheus_server
1327

14-
from ai_server.server import create_server
28+
from databend_aiserver.server import create_server
1529

1630
logger = logging.getLogger(__name__)
1731

databend_aiserver/server.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Copyright 2025 Databend Labs
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Entrypoint for the Databend AI UDF server."""
16+
17+
from __future__ import annotations
18+
19+
from typing import Optional
20+
21+
from databend_udf import UDFServer
22+
23+
from databend_aiserver.udfs import (
24+
aiserver_list_stage_files,
25+
aiserver_read_docx,
26+
aiserver_read_pdf,
27+
aiserver_vector_embed_text_1024,
28+
)
29+
30+
31+
def create_server(
32+
host: str = "0.0.0.0", port: int = 8815, metric_port: Optional[int] = None
33+
) -> UDFServer:
34+
"""
35+
Create a configured UDF server instance.
36+
37+
Parameters
38+
----------
39+
host:
40+
Bind address for the Flight server.
41+
port:
42+
Bind port for the Flight server.
43+
metric_port:
44+
Optional metrics port for Prometheus exporter. When provided the
45+
databend-udf server will expose metrics via Prometheus.
46+
"""
47+
location = f"{host}:{port}"
48+
metric_location = (
49+
f"{host}:{metric_port}" if metric_port is not None else None
50+
)
51+
server = UDFServer(location, metric_location=metric_location)
52+
server.add_function(aiserver_list_stage_files)
53+
server.add_function(aiserver_read_pdf)
54+
server.add_function(aiserver_read_docx)
55+
server.add_function(aiserver_vector_embed_text_1024)
56+
return server

ai_server/stages/operator.py renamed to databend_aiserver/stages/operator.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
# Copyright 2025 Databend Labs
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
115
"""Helpers to translate Databend stage metadata into OpenDAL operators."""
216

317
from __future__ import annotations

0 commit comments

Comments
 (0)