Skip to content

Commit f719411

Browse files
authored
Merge pull request #71 from ImagingDataCommons/reading-from-blobs
Section on reading from blobs with Python
2 parents 4488a06 + cf9b7e5 commit f719411

File tree

3 files changed

+326
-0
lines changed

3 files changed

+326
-0
lines changed
1.16 MB
Loading

SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
* [Organization of data in v1 (deprecated)](data/organization-of-data/organization-of-data-v1.md)
3838
* [Downloading data](data/downloading-data/README.md)
3939
* [Downloading data with s5cmd](data/downloading-data/downloading-data-with-s5cmd.md)
40+
* [Directly loading DICOM objects from Google Cloud or AWS in Python](data/downloading-data/direct-loading.md)
4041
* [Data release notes](data/data-release-notes.md)
4142
* [Data known issues](data/data-known-issues.md)
4243

Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
# Directly loading DICOM objects from Google Cloud or AWS in Python
2+
3+
DICOM files in the IDC are stored as "blobs" on the cloud, with one copy housed on Google Cloud Storage (GCS) and another on Amazon Web Services (AWS) S3 storage. By using the right tools, these blobs can be wrapped to appear as "file-like" objects to Python DICOM libraries, enabling intelligent loading of DICOM files directly from cloud storage as if they were local files without having to first download them onto a local drive.
4+
### Reading files with Pydicom
5+
6+
[Pydicom][2] is popular library for working with DICOM files in Python. Its [dcmread][3] function is able to accept any "file-like" object, meaning you can read a file straight from a cloud blob if you know its path. See [this page](../organization-of-data/files-and-metadata.md#storage-buckets) for information on finding the paths of the blobs for DICOM objects in IDC. The `dcmread` function also has some other options that allow you to control what is read. For example you can choose to read only the metadata and not the pixel data, or read only certain attributes. In the following two sections, we demonstrate these abilities using first Google Cloud Storage blobs and then AWS S3 blobs.
7+
8+
##### Mapping IDC DICOM series to bucket URLs
9+
10+
All of the image data available from IDC is replicated between public Google Cloud Storage (GCS) and AWS buckets. pip-installable [idc-index](https://github.com/imagingdatacommons/idc-index) package provides convenience functions to get URLs of the files corresponding to a given DICOM series.
11+
12+
```python
13+
from idc_index import IDCClient
14+
15+
16+
# Create IDCClient for looking up bucket URLs
17+
idc_client = IDCClient()
18+
19+
# Get the list of GCS file URLs in Google bucket from SeriesInstanceUID
20+
gcs_file_urls = idc_client.get_series_file_URLs(
21+
seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
22+
source_bucket_location="gcs",
23+
)
24+
25+
# Get the list of AWS file URLs in AWS bucket from SeriesInstanceUID
26+
aws_file_urls = idc_client.get_series_file_URLs(
27+
seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
28+
source_bucket_location="aws",
29+
)
30+
```
31+
32+
##### From Google Cloud Storage blobs
33+
34+
The [official Python SDK for Google Cloud Storage][1] (installable from pip and PyPI as `google-cloud-storage`) provides a "file-like" interface allowing other Python libraries, such as Pydicom, to work with blobs as if they were "normal" files on the local filesystem.
35+
36+
To read from a GCS blob with Pydicom, first create a storage client and blob object, representing a remote blob object stored on the cloud, then simply use the `.open('rb')` method to create a readable file-like object that can be passed to the `dcmread` function.
37+
38+
```python
39+
from pydicom import dcmread
40+
from pydicom.datadict import keyword_dict
41+
from google.cloud import storage
42+
from idc_index import IDCClient
43+
44+
45+
# Create IDCClient for looking up bucket URLs
46+
idc_client = IDCClient()
47+
48+
# Create a client and bucket object representing the IDC public data bucket
49+
gcs_client = storage.Client.create_anonymous_client()
50+
51+
# This example uses a CT series in the IDC.
52+
# get the list of file URLs in Google bucket from the SeriesInstanceUID
53+
file_urls = idc_client.get_series_file_URLs(
54+
seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
55+
source_bucket_location="gcs",
56+
)
57+
58+
# URLs will look like this:
59+
# s3://idc-open-data/668029cf-41bf-4644-b68a-46b8fa99c3bc/f4fe9671-0a99-4b6d-9641-d441f13620d4.dcm
60+
(_, _, bucket_name, folder_name, file_name) = file_urls[0].split("/")
61+
blob_key = f"{folder_name}/{file_name}"
62+
63+
# These objects represent the bucket and a single image blob within the bucket
64+
bucket = gcs_client.bucket(bucket_name)
65+
blob = bucket.blob(blob_key)
66+
67+
# Read the whole file directly from the blob
68+
with blob.open("rb") as reader:
69+
dcm = dcmread(reader)
70+
71+
# Read metadata only (no pixel data)
72+
with blob.open("rb") as reader:
73+
dcm = dcmread(reader, stop_before_pixels=True)
74+
75+
# Read only specific attributes, identified by their tag
76+
# (here the Manufacturer and ManufacturerModelName attributes)
77+
with blob.open("rb") as reader:
78+
dcm = dcmread(
79+
reader,
80+
specific_tags=[keyword_dict['Manufacturer'], keyword_dict['ManufacturerModelName']],
81+
)
82+
print(dcm)
83+
```
84+
85+
Reading only metadata or only specific attributes will reduce the amount of data that needs to be pulled down under some circumstances and therefore make the loading process faster. This depends on the size of the attributes being retrieved, the `chunk_size` (a parameter of the `open()` method that controls how much data is pulled in each HTTP request to the server), and the position of the requested element within the file (since it is necessary to seek through the file until the requested attributes are found, but any data after the requested attributes need not be pulled).
86+
87+
This works because running the [open][4] method on a Blob object returns a [BlobReader][5] object, which has a "file-like" interface (specifically the ``seek``, ``read``, and ``tell`` methods).
88+
89+
##### From AWS S3 blobs
90+
91+
The `boto3` package provides a Python API for accessing S3 blobs. It can be installed with `pip install boto3`. In order to access open IDC data without providing AWS credentials, it is necessary to configure your own client object such that it does not require signing. This is demonstrated in the following example, which repeats the above example using the counterpart of the same blob on AWS S3. If you want to read an entire file, we recommend using a temporary buffer like this:
92+
93+
```python
94+
from io import BytesIO
95+
from pydicom import dcmread
96+
97+
import boto3
98+
from botocore import UNSIGNED
99+
from botocore.config import Config
100+
from idc_index import IDCClient
101+
102+
103+
# Create IDCClient for looking up bucket URLs
104+
idc_client = IDCClient()
105+
106+
# This example uses a CT series in the IDC (same as above).
107+
# Get the list of file URLs in AWS bucket from SeriesInstanceUID
108+
file_urls = idc_client.get_series_file_URLs(
109+
seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
110+
source_bucket_location="aws",
111+
)
112+
113+
# URLs will look like this:
114+
# s3://idc-open-data/668029cf-41bf-4644-b68a-46b8fa99c3bc/f4fe9671-0a99-4b6d-9641-d441f13620d4.dcm
115+
(_, _, bucket_name, folder_name, file_name) = file_urls[0].split("/")
116+
blob_key = f"{folder_name}/{file_name}"
117+
118+
# Configure a client to avoid the need for AWS credentials
119+
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
120+
121+
with BytesIO() as buf:
122+
# Download entire file contents to an in-memory buffer
123+
s3_client.download_fileobj("idc-open-data", blob_key, buf)
124+
125+
# Use pydicom to read from the in-memory buffer
126+
buf.seek(0)
127+
dcm = dcmread(buf)
128+
```
129+
130+
Unlike `google-cloud-storage`, `boto3` does not provide a file-like interface to access data in blobs. Instead, the `smart_open` [package][15] is a third-party package that wraps an S3 client to expose a "file-like" interface. It can be installed with `pip install 'smart_open[s3]'`. However, we have found that the buffering behavior of this package (which is intended for streaming) is not well matched to the use case of reading DICOM metadata, resulting in many unnecassary requests while reading the metadata of DICOM files (see [this](https://github.com/piskvorky/smart_open/issues/712) issue). Therefore while the following will work, we recommend using the approach in the above example (downloading the whole file) in most cases even if you only want to read the metadata as it will likely be much faster. The exception to this is when reading only the metadata of very large images where the total amount of pixel data dwarfs the amount of metadata (or using frame-level access to such images, see below).
131+
132+
```python
133+
from pydicom import dcmread
134+
135+
import boto3
136+
from botocore import UNSIGNED
137+
from botocore.config import Config
138+
import smart_open
139+
140+
from idc_index import IDCClient
141+
142+
# Create IDCClient for looking up bucket URLs
143+
idc_client = IDCClient()
144+
145+
# Get the list of file URLs in AWS bucket from SeriesInstanceUID
146+
file_urls = idc_client.get_series_file_URLs(
147+
seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
148+
source_bucket_location="aws"
149+
)
150+
151+
# URL to an IDC CT image on AWS S3
152+
url = file_urls[0]
153+
154+
# Configure a client to avoid the need for AWS credentials
155+
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
156+
157+
# Read the whole file directly from the blob
158+
with smart_open.open(url, mode="rb", transport_params=dict(client=s3_client)) as reader:
159+
dcm = dcmread(reader)
160+
161+
# Read metadata only (no pixel data)
162+
with smart_open.open(url, mode="rb", transport_params=dict(client=s3_client)) as reader:
163+
dcm = dcmread(reader, stop_before_pixels=True)
164+
```
165+
166+
You may want to look into the the other options of `smart_open`'s `open` [method][16] to improve performance (in particular the `buffering` parameter).
167+
168+
In the remainder of the examples, we will use only the GCS access method for brevity. However, you should be able to straightforwardly swap out the opened GCS blob for the opened AWS S3 blob to achieve the same effect with Amazon S3.
169+
170+
### Frame-level access with Highdicom
171+
172+
[Highdicom][6] is a higher-level library providing several features to work with images and image-derived DICOM objects. As of the release 0.25.1, its various reading methods (including [imread][7], [segread][8], [annread][9], and [srread][10]) can read any file-like object, including Google Cloud blobs and anything opened with `smart_open` (including S3 blobs).
173+
174+
A particularly useful feature when working with blobs is ["lazy" frame retrieval][13] for images and segmentations. This downloads only the image metadata when the file is initially loaded, uses it to create a frame-level index, and downloads specific frames as and when they are requested by the user. This is especially useful for large multiframe files (such as those found in slide microscopy or multi-segment binary or fractional segmentations) as it can significantly reduce the amount of data that needs to be downloaded to access a subset of the frames.
175+
176+
In this first example, we use lazy frame retrieval to load only a specific spatial patch from a large whole slide image from the IDC.
177+
178+
```python
179+
import numpy as np
180+
import highdicom as hd
181+
import matplotlib.pyplot as plt
182+
from google.cloud import storage
183+
from pydicom import dcmread
184+
from pydicom.datadict import keyword_dict
185+
186+
from idc_index import IDCClient
187+
188+
# Create IDCClient for looking up bucket URLs
189+
idc_client = IDCClient()
190+
191+
# install additional component of idc-index to resolve SM instances to file URLs
192+
idc_client.fetch_index("sm_instance_index")
193+
194+
# given SeriesInstanceUID of an SM series, find the instance that corresponds to the
195+
# highest resolution base layer of the image pyramid
196+
query = """
197+
SELECT SOPInstanceUID, TotalPixelMatrixColumns
198+
FROM sm_instance_index
199+
WHERE SeriesInstanceUID = '1.3.6.1.4.1.5962.99.1.1900325859.924065538.1719887277027.4.0'
200+
ORDER BY TotalPixelMatrixColumns DESC
201+
LIMIT 1
202+
"""
203+
result = idc_client.sql_query(query)
204+
205+
# get URL corresponding to the base layer instance in the Google Storage bucket
206+
base_layer_file_url = idc_client.get_instance_file_URL(sopInstanceUID=result.iloc[0]["SOPInstanceUID"], source_bucket_location="gcs")
207+
208+
# Create a storage client and use it to access the IDC's public data package
209+
gcs_client = storage.Client.create_anonymous_client()
210+
211+
(_,_, bucket_name, folder_name, file_name) = base_layer_file_url.split("/")
212+
blob_key = f"{folder_name}/{file_name}"
213+
214+
bucket = gcs_client.bucket(bucket_name)
215+
base_layer_blob = bucket.blob(blob_key)
216+
217+
# Read directly from the blob object using lazy frame retrieval
218+
with base_layer_blob.open(mode="rb") as reader:
219+
im = hd.imread(reader, lazy_frame_retrieval=True)
220+
221+
# Grab an arbitrary region of tile full pixel matrix
222+
region = im.get_total_pixel_matrix(
223+
row_start=15000,
224+
row_end=15512,
225+
column_start=17000,
226+
column_end=17512,
227+
dtype=np.uint8
228+
)
229+
230+
# Show the region
231+
plt.imshow(region)
232+
plt.show()
233+
```
234+
235+
Running this code should produce an output that looks like this:
236+
237+
<p align="center">
238+
<img src="../../.gitbook/assets/slide_screenshot.png" alt="Screenshot of slide region" width="524" height="454">
239+
</p>
240+
241+
As a further example, we use lazy frame retrieval to load only a specific set of segments from a large multi-organ segmentation of a CT image in the IDC stored in binary format (in binary segmentations, each segment is stored using a separate set of frames).
242+
243+
244+
```python
245+
import highdicom as hd
246+
from google.cloud import storage
247+
from idc_index import IDCClient
248+
249+
250+
# Create IDCClient for looking up bucket URLs
251+
idc_client = IDCClient()
252+
253+
# Get the file URL corresponding to the segmentation of a CT series
254+
# containing a large number of different organs - the same one as used in the
255+
# IDC Portal front page
256+
file_urls = idc_client.get_series_file_URLs(
257+
seriesInstanceUID="1.2.276.0.7230010.3.1.3.313263360.15787.1706310178.804490",
258+
source_bucket_location="gcs"
259+
)
260+
261+
(_, _, bucket_name, folder_name, file_name) = file_urls[0].split("/")
262+
263+
# Create a storage client and use it to access the IDC's public data package
264+
gcs_client = storage.Client.create_anonymous_client()
265+
bucket = gcs_client.bucket(bucket_name)
266+
267+
blob_name = f"{folder_name}/{file_name}"
268+
blob = bucket.blob(blob_name)
269+
270+
# Open the blob with "segread" using the "lazy frame retrieval" option
271+
with blob.open(mode="rb") as reader:
272+
seg = hd.seg.segread(reader, lazy_frame_retrieval=True)
273+
274+
# Find the segment number corresponding to the liver segment
275+
selected_segment_numbers = seg.get_segment_numbers(segment_label="Liver")
276+
277+
# Read in the selected segments lazily
278+
volume = seg.get_volume(
279+
segment_numbers=selected_segment_numbers,
280+
combine_segments=True,
281+
)
282+
283+
# Print dimensions of the liver segment volume
284+
print(volume.shape)
285+
```
286+
287+
See [this][11] page for more information on highdicom's `Image` class, and [this][12] page for the `Segmentation` class.
288+
289+
### The importance of offset tables for slide microscopy (SM) images
290+
291+
Achieving good performance for the Slide Microscopy frame-level retrievals requires the presence of either a "Basic Offset Table" or "Extended Offset Table" in the file. These tables specify the starting positions of each frame within the file's byte stream. Without an offset table being present, libraries such as highdicom have to parse through the pixel data to find markers that tell it where frame boundaries are, which involves pulling down significantly more data and is therefore very slow. This mostly eliminates the potential speed benefits of frame-level retrieval. Unfortunately there is no simple way to know whether a file has an offset table without downloading the pixel data and checking it. If you find that an image takes a long time to load initially, it is probably because highdicom is constucting the offset table itself because it wasn't included in the file.
292+
293+
Most IDC images do include an offset table, but some of the older pathology slide images do not. [This page][14] contains some notes about whether individual collections include offset tables.
294+
295+
You can also check whether an image file (including pixel data) has an offset table using pydicom like this:
296+
297+
```python
298+
import pydicom
299+
300+
301+
dcm = pydicom.dcmread("...") # Any method to read from file/cloud storage
302+
303+
304+
print("Has Extended Offset Table:", "ExtendedOffsetTable" in dcm)
305+
print("Has Basic Offset Table:", dcm.Pixeldata[4:8] != b'\x00\x00\x00\x00')
306+
307+
```
308+
309+
310+
[1]: https://cloud.google.com/python/docs/reference/storage/latest/
311+
[2]: https://pydicom.github.io/pydicom/stable/index.html
312+
[3]: https://pydicom.github.io/pydicom/stable/reference/generated/pydicom.filereader.dcmread.html#pydicom.filereader.dcmread
313+
[4]: https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#google_cloud_storage_blob_Blob_open
314+
[5]: https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.fileio.BlobReader
315+
[6]: https://highdicom.readthedocs.io
316+
[7]: https://highdicom.readthedocs.io/en/latest/package.html#highdicom.imread
317+
[8]: https://highdicom.readthedocs.io/en/latest/package.html#highdicom.seg.segread
318+
[9]: https://highdicom.readthedocs.io/en/latest/package.html#highdicom.ann.annread
319+
[10]: https://highdicom.readthedocs.io/en/latest/package.html#highdicom.sr.srread
320+
[11]: https://highdicom.readthedocs.io/en/latest/image.html
321+
[12]: https://highdicom.readthedocs.io/en/latest/seg.html
322+
[13]: https://highdicom.readthedocs.io/en/latest/image.html#lazy
323+
[14]: https://github.com/ImagingDataCommons/idc-wsi-conversion?tab=readme-ov-file#overview
324+
[15]: https://github.com/piskvorky/smart_open
325+
[16]: https://github.com/piskvorky/smart_open/blob/master/help.txt

0 commit comments

Comments
 (0)