Skip to content

Commit 806e5f5

Browse files
SEA: Decouple Link Fetching (#632)
* test getting the list of allowed configurations Signed-off-by: varun-edachali-dbx <[email protected]> * reduce diff Signed-off-by: varun-edachali-dbx <[email protected]> * reduce diff Signed-off-by: varun-edachali-dbx <[email protected]> * house constants in enums for readability and immutability Signed-off-by: varun-edachali-dbx <[email protected]> * add note on hybrid disposition Signed-off-by: varun-edachali-dbx <[email protected]> * [squashed from cloudfetch-sea] introduce external links + arrow functionality Signed-off-by: varun-edachali-dbx <[email protected]> * reduce responsibility of Queue Signed-off-by: varun-edachali-dbx <[email protected]> * reduce repetition in arrow tablee creation Signed-off-by: varun-edachali-dbx <[email protected]> * reduce redundant code in CloudFetchQueue Signed-off-by: varun-edachali-dbx <[email protected]> * move chunk link progression to separate func Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant log Signed-off-by: varun-edachali-dbx <[email protected]> * improve logging Signed-off-by: varun-edachali-dbx <[email protected]> * remove reliance on schema_bytes in SEA Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant note on arrow_schema_bytes Signed-off-by: varun-edachali-dbx <[email protected]> * use more fetch methods Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant schema_bytes from parent constructor Signed-off-by: varun-edachali-dbx <[email protected]> * only call get_chunk_link with non null chunk index Signed-off-by: varun-edachali-dbx <[email protected]> * align SeaResultSet structure with ThriftResultSet Signed-off-by: varun-edachali-dbx <[email protected]> * remvoe _fill_result_buffer from SeaResultSet Signed-off-by: varun-edachali-dbx <[email protected]> * reduce code repetition Signed-off-by: varun-edachali-dbx <[email protected]> * align SeaResultSet with ext-links-sea Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant methods Signed-off-by: varun-edachali-dbx <[email protected]> * update unit tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove accidental venv changes Signed-off-by: varun-edachali-dbx <[email protected]> * pre-fetch next chunk link on processing current Signed-off-by: varun-edachali-dbx <[email protected]> * reduce nesting Signed-off-by: varun-edachali-dbx <[email protected]> * line break after multi line pydoc Signed-off-by: varun-edachali-dbx <[email protected]> * re-introduce schema_bytes for better abstraction (likely temporary) Signed-off-by: varun-edachali-dbx <[email protected]> * add fetchmany_arrow and fetchall_arrow Signed-off-by: varun-edachali-dbx <[email protected]> * remove accidental changes in sea backend tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove irrelevant changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary test changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary changes in thrift backend tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove unimplemented methods test Signed-off-by: varun-edachali-dbx <[email protected]> * remove unimplemented method tests Signed-off-by: varun-edachali-dbx <[email protected]> * modify example scripts to include fetch calls Signed-off-by: varun-edachali-dbx <[email protected]> * add GetChunksResponse Signed-off-by: varun-edachali-dbx <[email protected]> * remove changes to sea test Signed-off-by: varun-edachali-dbx <[email protected]> * re-introduce accidentally removed description extraction method Signed-off-by: varun-edachali-dbx <[email protected]> * fix type errors (ssl_options, CHUNK_PATH_WITH_ID..., etc.) Signed-off-by: varun-edachali-dbx <[email protected]> * access ssl_options through connection Signed-off-by: varun-edachali-dbx <[email protected]> * DEBUG level Signed-off-by: varun-edachali-dbx <[email protected]> * remove explicit multi chunk test Signed-off-by: varun-edachali-dbx <[email protected]> * move cloud fetch queues back into utils.py Signed-off-by: varun-edachali-dbx <[email protected]> * remove excess docstrings Signed-off-by: varun-edachali-dbx <[email protected]> * move ThriftCloudFetchQueue above SeaCloudFetchQueue Signed-off-by: varun-edachali-dbx <[email protected]> * fix sea connector tests Signed-off-by: varun-edachali-dbx <[email protected]> * correct patch module path in cloud fetch queue tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove unimplemented methods test Signed-off-by: varun-edachali-dbx <[email protected]> * correct add_link docstring Signed-off-by: varun-edachali-dbx <[email protected]> * remove invalid import Signed-off-by: varun-edachali-dbx <[email protected]> * better align queries with JDBC impl Signed-off-by: varun-edachali-dbx <[email protected]> * line breaks after multi-line PRs Signed-off-by: varun-edachali-dbx <[email protected]> * remove unused imports Signed-off-by: varun-edachali-dbx <[email protected]> * fix: introduce ExecuteResponse import Signed-off-by: varun-edachali-dbx <[email protected]> * remove unimplemented metadata methods test, un-necessary imports Signed-off-by: varun-edachali-dbx <[email protected]> * introduce unit tests for metadata methods Signed-off-by: varun-edachali-dbx <[email protected]> * remove verbosity in ResultSetFilter docstring Co-authored-by: jayant <[email protected]> * remove un-necessary info in ResultSetFilter docstring Signed-off-by: varun-edachali-dbx <[email protected]> * remove explicit type checking, string literals around forward annotations Signed-off-by: varun-edachali-dbx <[email protected]> * house SQL commands in constants Signed-off-by: varun-edachali-dbx <[email protected]> * convert complex types to string if not _use_arrow_native_complex_types Signed-off-by: varun-edachali-dbx <[email protected]> * introduce unit tests for altered functionality Signed-off-by: varun-edachali-dbx <[email protected]> * Revert "Merge branch 'fetch-json-inline' into ext-links-sea" This reverts commit dabba55, reversing changes made to dd7dc6a. Signed-off-by: varun-edachali-dbx <[email protected]> * reduce verbosity of ResultSetFilter docstring Signed-off-by: varun-edachali-dbx <[email protected]> * remove unused imports Signed-off-by: varun-edachali-dbx <[email protected]> * Revert "Merge branch 'fetch-json-inline' into ext-links-sea" This reverts commit 3a999c0, reversing changes made to a1f9b9c. * Revert "reduce verbosity of ResultSetFilter docstring" This reverts commit a1f9b9c. * Reapply "Merge branch 'fetch-json-inline' into ext-links-sea" This reverts commit 48ad7b3. * Revert "Merge branch 'fetch-json-inline' into ext-links-sea" This reverts commit dabba55, reversing changes made to dd7dc6a. * remove un-necessary filters changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary backend changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove constants changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove changes in filters tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove unit test backend and JSON queue changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove changes in sea result set testing Signed-off-by: varun-edachali-dbx <[email protected]> * Revert "remove changes in sea result set testing" This reverts commit d210ccd. * Revert "remove unit test backend and JSON queue changes" This reverts commit f6c5950. * Revert "remove changes in filters tests" This reverts commit f3f795a. * Revert "remove constants changes" This reverts commit 802d045. * Revert "remove un-necessary backend changes" This reverts commit 20822e4. * Revert "remove un-necessary filters changes" This reverts commit 5e75fb5. * remove unused imports Signed-off-by: varun-edachali-dbx <[email protected]> * working version Signed-off-by: varun-edachali-dbx <[email protected]> * adopy _wait_until_command_done Signed-off-by: varun-edachali-dbx <[email protected]> * introduce metadata commands Signed-off-by: varun-edachali-dbx <[email protected]> * use new backend structure Signed-off-by: varun-edachali-dbx <[email protected]> * constrain backend diff Signed-off-by: varun-edachali-dbx <[email protected]> * remove changes to filters Signed-off-by: varun-edachali-dbx <[email protected]> * make _parse methods in models internal Signed-off-by: varun-edachali-dbx <[email protected]> * reduce changes in unit tests Signed-off-by: varun-edachali-dbx <[email protected]> * run small queries with SEA during integration tests Signed-off-by: varun-edachali-dbx <[email protected]> * run some tests for sea Signed-off-by: varun-edachali-dbx <[email protected]> * allow empty schema bytes for alignment with SEA Signed-off-by: varun-edachali-dbx <[email protected]> * pass is_vl_op to Sea backend ExecuteResponse Signed-off-by: varun-edachali-dbx <[email protected]> * remove catalog requirement in get_tables Signed-off-by: varun-edachali-dbx <[email protected]> * move filters.py to SEA utils Signed-off-by: varun-edachali-dbx <[email protected]> * ensure SeaResultSet Signed-off-by: varun-edachali-dbx <[email protected]> * prevent circular imports Signed-off-by: varun-edachali-dbx <[email protected]> * remove unused imports Signed-off-by: varun-edachali-dbx <[email protected]> * remove cast, throw error if not SeaResultSet Signed-off-by: varun-edachali-dbx <[email protected]> * pass param as TSparkParameterValue Signed-off-by: varun-edachali-dbx <[email protected]> * remove failing test (temp) Signed-off-by: varun-edachali-dbx <[email protected]> * remove SeaResultSet type assertion Signed-off-by: varun-edachali-dbx <[email protected]> * change errors to align with spec, instead of arbitrary ValueError Signed-off-by: varun-edachali-dbx <[email protected]> * make SEA backend methods return SeaResultSet Signed-off-by: varun-edachali-dbx <[email protected]> * use spec-aligned Exceptions in SEA backend Signed-off-by: varun-edachali-dbx <[email protected]> * remove defensive row type check Signed-off-by: varun-edachali-dbx <[email protected]> * raise ProgrammingError for invalid id Signed-off-by: varun-edachali-dbx <[email protected]> * make is_volume_operation strict bool Signed-off-by: varun-edachali-dbx <[email protected]> * remove complex types code Signed-off-by: varun-edachali-dbx <[email protected]> * Revert "remove complex types code" This reverts commit 138359d. * introduce type conversion for primitive types for JSON + INLINE Signed-off-by: varun-edachali-dbx <[email protected]> * remove SEA running on metadata queries (known failures Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary docstrings Signed-off-by: varun-edachali-dbx <[email protected]> * align expected types with databricks sdk Signed-off-by: varun-edachali-dbx <[email protected]> * link rest api reference to validate types Signed-off-by: varun-edachali-dbx <[email protected]> * remove test_catalogs_returns_arrow_table test metadata commands not expected to pass Signed-off-by: varun-edachali-dbx <[email protected]> * fix fetchall_arrow and fetchmany_arrow Signed-off-by: varun-edachali-dbx <[email protected]> * remove thrift aligned test_cancel_during_execute from SEA tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary changes in example scripts Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary chagnes in example scripts Signed-off-by: varun-edachali-dbx <[email protected]> * _convert_json_table -> _create_json_table Signed-off-by: varun-edachali-dbx <[email protected]> * remove accidentally removed test Signed-off-by: varun-edachali-dbx <[email protected]> * remove new unit tests (to be re-added based on new arch) Signed-off-by: varun-edachali-dbx <[email protected]> * remove changes in sea_result_set functionality (to be re-added) Signed-off-by: varun-edachali-dbx <[email protected]> * introduce more integration tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove SEA tests in parameterized queries Signed-off-by: varun-edachali-dbx <[email protected]> * remove partial parameter fix changes Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary timestamp tests (pass with minor disparity) Signed-off-by: varun-edachali-dbx <[email protected]> * slightly stronger typing of _convert_json_types Signed-off-by: varun-edachali-dbx <[email protected]> * stronger typing of json utility func s Signed-off-by: varun-edachali-dbx <[email protected]> * stronger typing of fetch*_json Signed-off-by: varun-edachali-dbx <[email protected]> * remove unused helper methods in SqlType Signed-off-by: varun-edachali-dbx <[email protected]> * line breaks after multi line pydocs, remove excess logs Signed-off-by: varun-edachali-dbx <[email protected]> * line breaks after multi line pydocs, reduce diff of redundant changes Signed-off-by: varun-edachali-dbx <[email protected]> * reduce diff of redundant changes Signed-off-by: varun-edachali-dbx <[email protected]> * mandate ResultData in SeaResultSet constructor Signed-off-by: varun-edachali-dbx <[email protected]> * remove complex type conversion Signed-off-by: varun-edachali-dbx <[email protected]> * correct fetch*_arrow Signed-off-by: varun-edachali-dbx <[email protected]> * recover old sea tests Signed-off-by: varun-edachali-dbx <[email protected]> * move queue and result set into SEA specific dir Signed-off-by: varun-edachali-dbx <[email protected]> * pass ssl_options into CloudFetchQueue Signed-off-by: varun-edachali-dbx <[email protected]> * reduce diff Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant conversion.py Signed-off-by: varun-edachali-dbx <[email protected]> * fix type issues Signed-off-by: varun-edachali-dbx <[email protected]> * ValueError not ProgrammingError Signed-off-by: varun-edachali-dbx <[email protected]> * reduce diff Signed-off-by: varun-edachali-dbx <[email protected]> * introduce SEA cloudfetch e2e tests Signed-off-by: varun-edachali-dbx <[email protected]> * allow empty cloudfetch result Signed-off-by: varun-edachali-dbx <[email protected]> * add unit tests for CloudFetchQueue and SeaResultSet Signed-off-by: varun-edachali-dbx <[email protected]> * skip pyarrow dependent tests Signed-off-by: varun-edachali-dbx <[email protected]> * simplify download process: no pre-fetching Signed-off-by: varun-edachali-dbx <[email protected]> * correct class name in logs Signed-off-by: varun-edachali-dbx <[email protected]> * align with old impl Signed-off-by: varun-edachali-dbx <[email protected]> * align next_n_rows with prev imple Signed-off-by: varun-edachali-dbx <[email protected]> * align remaining_rows with prev impl Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary Optional params Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary changes in thrift field if tests Signed-off-by: varun-edachali-dbx <[email protected]> * remove unused imports Signed-off-by: varun-edachali-dbx <[email protected]> * init hybrid * run large queries Signed-off-by: varun-edachali-dbx <[email protected]> * hybrid disposition Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-ncessary log Signed-off-by: varun-edachali-dbx <[email protected]> * formatting (black) Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant tests Signed-off-by: varun-edachali-dbx <[email protected]> * multi frame decompression of lz4 Signed-off-by: varun-edachali-dbx <[email protected]> * ensure no compression (temp) Signed-off-by: varun-edachali-dbx <[email protected]> * introduce separate link fetcher Signed-off-by: varun-edachali-dbx <[email protected]> * log time to create table Signed-off-by: varun-edachali-dbx <[email protected]> * add chunk index to table creation time log Signed-off-by: varun-edachali-dbx <[email protected]> * remove custom multi-frame decompressor for lz4 Signed-off-by: varun-edachali-dbx <[email protected]> * remove excess logs * remove redundant tests (temp) Signed-off-by: varun-edachali-dbx <[email protected]> * add link to download manager before notifying consumer Signed-off-by: varun-edachali-dbx <[email protected]> * move link fetching immediately before table creation so link expiry is not an issue Signed-off-by: varun-edachali-dbx <[email protected]> * resolve merge artifacts Signed-off-by: varun-edachali-dbx <[email protected]> * remove redundant methods Signed-off-by: varun-edachali-dbx <[email protected]> * formatting (black) Signed-off-by: varun-edachali-dbx <[email protected]> * introduce callback to handle link expiry Signed-off-by: varun-edachali-dbx <[email protected]> * fix types Signed-off-by: varun-edachali-dbx <[email protected]> * fix param type in unit tests Signed-off-by: varun-edachali-dbx <[email protected]> * formatting + minor type fixes Signed-off-by: varun-edachali-dbx <[email protected]> * Revert "introduce callback to handle link expiry" This reverts commit bd51b1c. * remove unused callback (to be introduced later) Signed-off-by: varun-edachali-dbx <[email protected]> * correct param extraction Signed-off-by: varun-edachali-dbx <[email protected]> * remove common constructor for databricks client abc Signed-off-by: varun-edachali-dbx <[email protected]> * make SEA Http Client instance a private member Signed-off-by: varun-edachali-dbx <[email protected]> * make GetChunksResponse model more robust Signed-off-by: varun-edachali-dbx <[email protected]> * add link to doc of GetChunk response model Signed-off-by: varun-edachali-dbx <[email protected]> * pass result_data instead of "initial links" into SeaCloudFetchQueue Signed-off-by: varun-edachali-dbx <[email protected]> * move download_manager init into parent CloudFetchQueue Signed-off-by: varun-edachali-dbx <[email protected]> * raise ServerOperationError for no 0th chunk Signed-off-by: varun-edachali-dbx <[email protected]> * unused iports Signed-off-by: varun-edachali-dbx <[email protected]> * return None in case of empty respose Signed-off-by: varun-edachali-dbx <[email protected]> * ensure table is empty on no initial link s Signed-off-by: varun-edachali-dbx <[email protected]> * account for total chunk count Signed-off-by: varun-edachali-dbx <[email protected]> * iterate by chunk index instead of link Signed-off-by: varun-edachali-dbx <[email protected]> * make LinkFetcher convert link static Signed-off-by: varun-edachali-dbx <[email protected]> * add helper for link addition, check for edge case to prevent inf wait Signed-off-by: varun-edachali-dbx <[email protected]> * add unit tests for LinkFetcher Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary download manager check Signed-off-by: varun-edachali-dbx <[email protected]> * remove un-necessary string literals around param type Signed-off-by: varun-edachali-dbx <[email protected]> * remove duplicate download_manager init Signed-off-by: varun-edachali-dbx <[email protected]> * account for empty response in LinkFetcher init Signed-off-by: varun-edachali-dbx <[email protected]> * make get_chunk_link return mandatory ExternalLink Signed-off-by: varun-edachali-dbx <[email protected]> * set shutdown_event instead of breaking on completion so get_chunk_link is informed Signed-off-by: varun-edachali-dbx <[email protected]> * docstrings, logging, pydoc Signed-off-by: varun-edachali-dbx <[email protected]> * use total_chunk_cound > 0 Signed-off-by: varun-edachali-dbx <[email protected]> * clarify that link has already been submitted on getting row_offset Signed-off-by: varun-edachali-dbx <[email protected]> * return None for out of range Signed-off-by: varun-edachali-dbx <[email protected]> * default link_fetcher to None Signed-off-by: varun-edachali-dbx <[email protected]> --------- Signed-off-by: varun-edachali-dbx <[email protected]>
1 parent 8fbca9d commit 806e5f5

File tree

2 files changed

+371
-89
lines changed

2 files changed

+371
-89
lines changed

src/databricks/sql/backend/sea/queue.py

Lines changed: 201 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
from __future__ import annotations
22

33
from abc import ABC
4-
from typing import List, Optional, Tuple, Union, TYPE_CHECKING
4+
import threading
5+
from typing import Dict, List, Optional, Tuple, Union, TYPE_CHECKING
56

67
from databricks.sql.cloudfetch.download_manager import ResultFileDownloadManager
78

@@ -121,6 +122,179 @@ def close(self):
121122
return
122123

123124

125+
class LinkFetcher:
126+
"""
127+
Background helper that incrementally retrieves *external links* for a
128+
result set produced by the SEA backend and feeds them to a
129+
:class:`databricks.sql.cloudfetch.download_manager.ResultFileDownloadManager`.
130+
131+
The SEA backend splits large result sets into *chunks*. Each chunk is
132+
stored remotely (e.g., in object storage) and exposed via a signed URL
133+
encapsulated by an :class:`ExternalLink`. Only the first batch of links is
134+
returned with the initial query response. The remaining links must be
135+
pulled on demand using the *next-chunk* token embedded in each
136+
:pyattr:`ExternalLink.next_chunk_index`.
137+
138+
LinkFetcher takes care of this choreography so callers (primarily
139+
``SeaCloudFetchQueue``) can simply ask for the link of a specific
140+
``chunk_index`` and block until it becomes available.
141+
142+
Key responsibilities:
143+
144+
• Maintain an in-memory mapping from ``chunk_index`` → ``ExternalLink``.
145+
• Launch a background worker thread that continuously requests the next
146+
batch of links from the backend until all chunks have been discovered or
147+
an unrecoverable error occurs.
148+
• Bridge SEA link objects to the Thrift representation expected by the
149+
existing download manager.
150+
• Provide a synchronous API (`get_chunk_link`) that blocks until the desired
151+
link is present in the cache.
152+
"""
153+
154+
def __init__(
155+
self,
156+
download_manager: ResultFileDownloadManager,
157+
backend: SeaDatabricksClient,
158+
statement_id: str,
159+
initial_links: List[ExternalLink],
160+
total_chunk_count: int,
161+
):
162+
self.download_manager = download_manager
163+
self.backend = backend
164+
self._statement_id = statement_id
165+
166+
self._shutdown_event = threading.Event()
167+
168+
self._link_data_update = threading.Condition()
169+
self._error: Optional[Exception] = None
170+
self.chunk_index_to_link: Dict[int, ExternalLink] = {}
171+
172+
self._add_links(initial_links)
173+
self.total_chunk_count = total_chunk_count
174+
175+
# DEBUG: capture initial state for observability
176+
logger.debug(
177+
"LinkFetcher[%s]: initialized with %d initial link(s); expecting %d total chunk(s)",
178+
statement_id,
179+
len(initial_links),
180+
total_chunk_count,
181+
)
182+
183+
def _add_links(self, links: List[ExternalLink]):
184+
"""Cache *links* locally and enqueue them with the download manager."""
185+
logger.debug(
186+
"LinkFetcher[%s]: caching %d link(s) – chunks %s",
187+
self._statement_id,
188+
len(links),
189+
", ".join(str(l.chunk_index) for l in links) if links else "<none>",
190+
)
191+
for link in links:
192+
self.chunk_index_to_link[link.chunk_index] = link
193+
self.download_manager.add_link(LinkFetcher._convert_to_thrift_link(link))
194+
195+
def _get_next_chunk_index(self) -> Optional[int]:
196+
"""Return the next *chunk_index* that should be requested from the backend, or ``None`` if we have them all."""
197+
with self._link_data_update:
198+
max_chunk_index = max(self.chunk_index_to_link.keys(), default=None)
199+
if max_chunk_index is None:
200+
return 0
201+
max_link = self.chunk_index_to_link[max_chunk_index]
202+
return max_link.next_chunk_index
203+
204+
def _trigger_next_batch_download(self) -> bool:
205+
"""Fetch the next batch of links from the backend and return *True* on success."""
206+
logger.debug(
207+
"LinkFetcher[%s]: requesting next batch of links", self._statement_id
208+
)
209+
next_chunk_index = self._get_next_chunk_index()
210+
if next_chunk_index is None:
211+
return False
212+
213+
try:
214+
links = self.backend.get_chunk_links(self._statement_id, next_chunk_index)
215+
with self._link_data_update:
216+
self._add_links(links)
217+
self._link_data_update.notify_all()
218+
except Exception as e:
219+
logger.error(
220+
f"LinkFetcher: Error fetching links for chunk {next_chunk_index}: {e}"
221+
)
222+
with self._link_data_update:
223+
self._error = e
224+
self._link_data_update.notify_all()
225+
return False
226+
227+
logger.debug(
228+
"LinkFetcher[%s]: received %d new link(s)",
229+
self._statement_id,
230+
len(links),
231+
)
232+
return True
233+
234+
def get_chunk_link(self, chunk_index: int) -> Optional[ExternalLink]:
235+
"""Return (blocking) the :class:`ExternalLink` associated with *chunk_index*."""
236+
logger.debug(
237+
"LinkFetcher[%s]: waiting for link of chunk %d",
238+
self._statement_id,
239+
chunk_index,
240+
)
241+
if chunk_index >= self.total_chunk_count:
242+
return None
243+
244+
with self._link_data_update:
245+
while chunk_index not in self.chunk_index_to_link:
246+
if self._error:
247+
raise self._error
248+
if self._shutdown_event.is_set():
249+
raise ProgrammingError(
250+
"LinkFetcher is shutting down without providing link for chunk index {}".format(
251+
chunk_index
252+
)
253+
)
254+
self._link_data_update.wait()
255+
256+
return self.chunk_index_to_link[chunk_index]
257+
258+
@staticmethod
259+
def _convert_to_thrift_link(link: ExternalLink) -> TSparkArrowResultLink:
260+
"""Convert SEA external links to Thrift format for compatibility with existing download manager."""
261+
# Parse the ISO format expiration time
262+
expiry_time = int(dateutil.parser.parse(link.expiration).timestamp())
263+
return TSparkArrowResultLink(
264+
fileLink=link.external_link,
265+
expiryTime=expiry_time,
266+
rowCount=link.row_count,
267+
bytesNum=link.byte_count,
268+
startRowOffset=link.row_offset,
269+
httpHeaders=link.http_headers or {},
270+
)
271+
272+
def _worker_loop(self):
273+
"""Entry point for the background thread."""
274+
logger.debug("LinkFetcher[%s]: worker thread started", self._statement_id)
275+
while not self._shutdown_event.is_set():
276+
links_downloaded = self._trigger_next_batch_download()
277+
if not links_downloaded:
278+
self._shutdown_event.set()
279+
logger.debug("LinkFetcher[%s]: worker thread exiting", self._statement_id)
280+
self._link_data_update.notify_all()
281+
282+
def start(self):
283+
"""Spawn the worker thread."""
284+
logger.debug("LinkFetcher[%s]: starting worker thread", self._statement_id)
285+
self._worker_thread = threading.Thread(
286+
target=self._worker_loop, name=f"LinkFetcher-{self._statement_id}"
287+
)
288+
self._worker_thread.start()
289+
290+
def stop(self):
291+
"""Signal the worker thread to stop and wait for its termination."""
292+
logger.debug("LinkFetcher[%s]: stopping worker thread", self._statement_id)
293+
self._shutdown_event.set()
294+
self._worker_thread.join()
295+
logger.debug("LinkFetcher[%s]: worker thread stopped", self._statement_id)
296+
297+
124298
class SeaCloudFetchQueue(CloudFetchQueue):
125299
"""Queue implementation for EXTERNAL_LINKS disposition with ARROW format for SEA backend."""
126300

@@ -158,80 +332,49 @@ def __init__(
158332
description=description,
159333
)
160334

161-
self._sea_client = sea_client
162-
self._statement_id = statement_id
163-
self._total_chunk_count = total_chunk_count
164-
165335
logger.debug(
166336
"SeaCloudFetchQueue: Initialize CloudFetch loader for statement {}, total chunks: {}".format(
167337
statement_id, total_chunk_count
168338
)
169339
)
170340

171341
initial_links = result_data.external_links or []
172-
self._chunk_index_to_link = {link.chunk_index: link for link in initial_links}
173342

174343
# Track the current chunk we're processing
175344
self._current_chunk_index = 0
176-
first_link = self._chunk_index_to_link.get(self._current_chunk_index, None)
177-
if not first_link:
178-
# possibly an empty response
179-
return None
180345

181-
# Track the current chunk we're processing
182-
self._current_chunk_index = 0
183-
# Initialize table and position
184-
self.table = self._create_table_from_link(first_link)
346+
self.link_fetcher = None # for empty responses, we do not need a link fetcher
347+
if total_chunk_count > 0:
348+
self.link_fetcher = LinkFetcher(
349+
download_manager=self.download_manager,
350+
backend=sea_client,
351+
statement_id=statement_id,
352+
initial_links=initial_links,
353+
total_chunk_count=total_chunk_count,
354+
)
355+
self.link_fetcher.start()
185356

186-
def _convert_to_thrift_link(self, link: ExternalLink) -> TSparkArrowResultLink:
187-
"""Convert SEA external links to Thrift format for compatibility with existing download manager."""
188-
# Parse the ISO format expiration time
189-
expiry_time = int(dateutil.parser.parse(link.expiration).timestamp())
190-
return TSparkArrowResultLink(
191-
fileLink=link.external_link,
192-
expiryTime=expiry_time,
193-
rowCount=link.row_count,
194-
bytesNum=link.byte_count,
195-
startRowOffset=link.row_offset,
196-
httpHeaders=link.http_headers or {},
197-
)
357+
# Initialize table and position
358+
self.table = self._create_next_table()
198359

199-
def _get_chunk_link(self, chunk_index: int) -> Optional["ExternalLink"]:
200-
if chunk_index >= self._total_chunk_count:
360+
def _create_next_table(self) -> Union["pyarrow.Table", None]:
361+
"""Create next table by retrieving the logical next downloaded file."""
362+
if self.link_fetcher is None:
201363
return None
202364

203-
if chunk_index not in self._chunk_index_to_link:
204-
links = self._sea_client.get_chunk_links(self._statement_id, chunk_index)
205-
self._chunk_index_to_link.update({l.chunk_index: l for l in links})
206-
207-
link = self._chunk_index_to_link.get(chunk_index, None)
208-
if not link:
209-
raise ServerOperationError(
210-
f"Error fetching link for chunk {chunk_index}",
211-
{
212-
"operation-id": self._statement_id,
213-
"diagnostic-info": None,
214-
},
215-
)
216-
return link
217-
218-
def _create_table_from_link(
219-
self, link: ExternalLink
220-
) -> Union["pyarrow.Table", None]:
221-
"""Create a table from a link."""
222-
223-
thrift_link = self._convert_to_thrift_link(link)
224-
self.download_manager.add_link(thrift_link)
365+
chunk_link = self.link_fetcher.get_chunk_link(self._current_chunk_index)
366+
if chunk_link is None:
367+
return None
225368

226-
row_offset = link.row_offset
369+
row_offset = chunk_link.row_offset
370+
# NOTE: link has already been submitted to download manager at this point
227371
arrow_table = self._create_table_at_offset(row_offset)
228372

373+
self._current_chunk_index += 1
374+
229375
return arrow_table
230376

231-
def _create_next_table(self) -> Union["pyarrow.Table", None]:
232-
"""Create next table by retrieving the logical next downloaded file."""
233-
self._current_chunk_index += 1
234-
next_chunk_link = self._get_chunk_link(self._current_chunk_index)
235-
if not next_chunk_link:
236-
return None
237-
return self._create_table_from_link(next_chunk_link)
377+
def close(self):
378+
super().close()
379+
if self.link_fetcher:
380+
self.link_fetcher.stop()

0 commit comments

Comments
 (0)