Skip to content

Commit 59166d2

Browse files
committed
refinements to direct access page
* added section on getting file URLs using idc-index * updated all examples with the details on getting bucket file URLs using idc-index * replaced direct use of tag numbers with keyword lookup * updated importance of offset tables to clarify it is about SM This notebook contains all of the code samples for convenient testing - I plan to update this notebook and add it to IDC-Tutorials.
1 parent e136261 commit 59166d2

File tree

1 file changed

+121
-24
lines changed

1 file changed

+121
-24
lines changed

data/downloading-data/direct-loading.md

Lines changed: 121 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,25 @@ DICOM files in the IDC are stored as "blobs" on the cloud, with one copy housed
55

66
[Pydicom][2] is popular library for working with DICOM files in Python. Its [dcmread][3] function is able to accept any "file-like" object, meaning you can read a file straight from a cloud blob if you know its path. See [this page](../organization-of-data/files-and-metadata.md#storage-buckets) for information on finding the paths of the blobs for DICOM objects in IDC. The `dcmread` function also has some other options that allow you to control what is read. For example you can choose to read only the metadata and not the pixel data, or read only certain attributes. In the following two sections, we demonstrate these abilities using first Google Cloud Storage blobs and then AWS S3 blobs.
77

8+
##### Mapping IDC DICOM series to bucket URLs
9+
10+
All of the image data available from IDC is replicated between public Google Cloud Storage (GCS) and AWS buckets. pip-installable [idc-index](https://github.com/imagingdatacommons/idc-index) package provides convenience functions to get URLs of the files corresponding to a given DICOM series.
11+
12+
```python
13+
from idc_index import IDCClient
14+
15+
# create IDCClient() for looking up bucket URLs
16+
idc_client = IDCClient()
17+
18+
# get the list of GCS file URLs in Google bucket from SeriesInstanceUID
19+
gcs_file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
20+
source_bucket_location="gcs")
21+
22+
# get the list of AWS file URLs in Google bucket from SeriesInstanceUID
23+
aws_file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
24+
source_bucket_location="aws")
25+
```
26+
827
##### From Google Cloud Storage blobs
928

1029
The [official Python SDK for Google Cloud Storage][1] (installable from pip and PyPI as `google-cloud-storage`) provides a "file-like" interface allowing other Python libraries, such as Pydicom, to work with blobs as if they were "normal" files on the local filesystem.
@@ -15,15 +34,29 @@ To read from a GCS blob with Pydicom, first create a storage client and blob obj
1534
from pydicom import dcmread
1635
from google.cloud import storage
1736

37+
from pydicom.datadict import keyword_dict
38+
39+
from idc_index import IDCClient
40+
41+
# create IDCClient() for looking up bucket URLs
42+
idc_client = IDCClient()
1843

1944
# Create a client and bucket object representing the IDC public data bucket
2045
client = storage.Client.create_anonymous_client()
21-
bucket = client.bucket("idc-open-data")
46+
47+
# get the list of file URLs in Google bucket from SeriesInstanceUID
48+
file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
49+
source_bucket_location="gcs")
50+
51+
# URLs will look like this:
52+
# s3://idc-open-data/668029cf-41bf-4644-b68a-46b8fa99c3bc/f4fe9671-0a99-4b6d-9641-d441f13620d4.dcm
53+
(_,_,bucket_name,folder_name,file_name) = file_urls[0].split("/")
54+
blob_key = f"{folder_name}/{file_name}"
55+
56+
bucket = client.bucket(bucket_name)
2257

2358
# This is the path (within the above bucket) to a CT image in the IDC
24-
blob = bucket.blob(
25-
"f44633af-5e76-4e01-a7fe-63764fc7e8c2/e36b336b-3550-48c9-8457-c853eab14e25.dcm"
26-
)
59+
blob = bucket.blob(blob_key)
2760

2861
# Read the whole file directly from the blob
2962
with blob.open("rb") as reader:
@@ -36,7 +69,9 @@ with blob.open("rb") as reader:
3669
# Read only specific attributes, identified by their tag
3770
# (here the Manufacturer and ManufacturerModelName attributes)
3871
with blob.open("rb") as reader:
39-
dcm = dcmread(reader, specific_tags=[0x0008_0070, 0x0008_1090])
72+
dcm = dcmread(reader, specific_tags=[keyword_dict['Manufacturer'],
73+
keyword_dict['ManufacturerModelName']])
74+
print(dcm)
4075
```
4176

4277
Reading only metadata or only specific attributes will reduce the amount of data that needs to be pulled down under some circumstances and therefore make the loading process faster. This depends on the size of the attributes being retrieved, the `chunk_size` (a parameter of the `open()` method that controls how much data is pulled in each HTTP request to the server), and the position of the requested element within the file (since it is necessary to seek through the file until the requested attributes are found, but any data after the requested attributes need not be pulled).
@@ -48,17 +83,28 @@ This works because running the [open][4] method on a Blob object returns a [Blob
4883
The `boto3` package provides a Python API for accessing S3 blobs. It can be installed with `pip install boto3`. In order to access open IDC data without providing AWS credentials, it is necessary to configure your own client object such that it does not require signing. This is demonstrated in the following example, which repeats the above example using the counterpart of the same blob on AWS S3. If you want to read an entire file, we recommend using a temporary buffer like this:
4984

5085
```python
51-
5286
from io import BytesIO
5387
from pydicom import dcmread
5488

5589
import boto3
5690
from botocore import UNSIGNED
5791
from botocore.config import Config
5892

93+
from idc_index import IDCClient
94+
95+
# create IDCClient() for looking up bucket URLs
96+
idc_client = IDCClient()
97+
98+
# get the list of file URLs in AWS bucket from SeriesInstanceUID
99+
file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
100+
source_bucket_location="aws")
101+
102+
# URLs will look like this:
103+
# s3://idc-open-data/668029cf-41bf-4644-b68a-46b8fa99c3bc/f4fe9671-0a99-4b6d-9641-d441f13620d4.dcm
104+
(_,_,bucket_name,folder_name,file_name) = file_urls[0].split("/")
105+
blob_key = f"{folder_name}/{file_name}"
59106

60-
# Path (within the idc-open-data bucket) to an IDC CT image on AWS S3
61-
blob_key = "f44633af-5e76-4e01-a7fe-63764fc7e8c2/e36b336b-3550-48c9-8457-c853eab14e25.dcm"
107+
bucket = client.bucket(bucket_name)
62108

63109
# Configure a client to avoid the need for AWS credentials
64110
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
@@ -70,7 +116,6 @@ with BytesIO() as buf:
70116
# Use pydicom to read from the in-memory buffer
71117
buf.seek(0)
72118
dcm = dcmread(buf)
73-
74119
```
75120

76121
Unlike `google-cloud-storage`, `boto3` does not provide a file-like interface to access data in blobs. Instead, the `smart_open` [package][15] is a third-party package that wraps an S3 client to expose a "file-like" interface. It can be installed with `pip install 'smart_open[s3]'`. However, we have found that the buffering behavior of this package (which is intended for streaming) is not well matched to the use case of reading DICOM metadata, resulting in many unnecassary requests while reading the metadata of DICOM files (see [this](https://github.com/piskvorky/smart_open/issues/712) issue). Therefore while the following will work, we recommend using the approach in the above example (downloading the whole file) in most cases even if you only want to read the metadata as it will likely be much faster. The exception to this is when reading only the metadata of very large images where the total amount of pixel data dwarfs the amount of metadata (or using frame-level access to such images, see below).
@@ -83,12 +128,25 @@ from botocore import UNSIGNED
83128
from botocore.config import Config
84129
import smart_open
85130

131+
from idc_index import IDCClient
132+
133+
# create IDCClient() for looking up bucket URLs
134+
idc_client = IDCClient()
135+
136+
# get the list of file URLs in AWS bucket from SeriesInstanceUID
137+
file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.3.6.1.4.1.14519.5.2.1.131619305319442714547556255525285829796",
138+
source_bucket_location="aws")
139+
140+
# URL to an IDC CT image on AWS S3
141+
url = file_urls[0]
86142

87143
# Configure a client to avoid the need for AWS credentials
88144
s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED))
89145

90-
# URL to an IDC CT image on AWS S3
91-
url = 's3://idc-open-data/f44633af-5e76-4e01-a7fe-63764fc7e8c2/e36b336b-3550-48c9-8457-c853eab14e25.dcm'
146+
# Read the whole file directly from the blob
147+
dcm = dcmread(
148+
smart_open.open(url, mode="rb", transport_params=dict(client=s3_client)),
149+
)
92150

93151
# Read the whole file directly from the blob
94152
with smart_open.open(url, mode="rb", transport_params=dict(client=s3_client)) as reader:
@@ -116,20 +174,46 @@ import numpy as np
116174
import highdicom as hd
117175
import matplotlib.pyplot as plt
118176
from google.cloud import storage
177+
from pydicom import dcmread
178+
179+
from pydicom.datadict import keyword_dict
119180

181+
from idc_index import IDCClient
182+
183+
# create IDCClient() for looking up bucket URLs
184+
idc_client = IDCClient()
185+
186+
# get the list of file URLs in AWS bucket from SeriesInstanceUID
187+
# in this case we are using a series from the IDC CCDI-MCI collection
188+
file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.3.6.1.4.1.5962.99.1.1900325859.924065538.1719887277027.4.0",
189+
source_bucket_location="gcs")
190+
191+
(_,_,bucket_name,folder_name,file_name) = file_urls[0].split("/")
120192

121193
# Create a storage client and use it to access the IDC's public data package
122194
client = storage.Client.create_anonymous_client()
123-
bucket = client.bucket("idc-open-data")
195+
bucket = client.bucket(bucket_name)
196+
197+
# go over series instances to find the base (largest matrix) layer
198+
# based on TotalPixelMatrixColumns value
199+
largest_dimension = 0
200+
base_layer_blob = None
201+
for instance_file_url in file_urls:
202+
(_,_,_,folder_name,file_name) = instance_file_url.split("/")
203+
blob_name = f"{folder_name}/{file_name}"
204+
205+
blob = bucket.blob(blob_name)
206+
207+
with blob.open("rb") as reader:
208+
dcm = dcmread(reader, specific_tags=[keyword_dict['TotalPixelMatrixColumns']])
209+
total_columns = dcm.TotalPixelMatrixColumns
210+
if total_columns>largest_dimension:
211+
largest_dimension = total_columns
212+
base_layer_blob = blob
124213

125-
# This is the path (within the above bucket) to a whole slide image from the
126-
# IDC collection called "CCDI MCI"
127-
blob = bucket.blob(
128-
"763fe058-7d25-4ba7-9b29-fd3d6c41dc4b/210f0529-c767-4795-9acf-bad2f4877427.dcm"
129-
)
130214

131215
# Read directly from the blob object using lazy frame retrieval
132-
with blob.open(mode="rb") as reader:
216+
with base_layer_blob.open(mode="rb") as reader:
133217
im = hd.imread(reader, lazy_frame_retrieval=True)
134218

135219
# Grab an arbitrary region of tile full pixel matrix
@@ -159,15 +243,25 @@ As a further example, we use lazy frame retrieval to load only a specific set of
159243
import highdicom as hd
160244
from google.cloud import storage
161245

246+
# create IDCClient() for looking up bucket URLs
247+
from idc_index import IDCClient
248+
idc_client = IDCClient()
249+
250+
# Get the file URL corresponding to the segmentation of a CT series
251+
# containing a large number of different organs - the same one as used in the
252+
# IDC Portal front page
253+
file_urls = idc_client.get_series_file_URLs(seriesInstanceUID="1.2.276.0.7230010.3.1.3.313263360.15787.1706310178.804490",
254+
source_bucket_location="gcs")
255+
256+
(_,_,bucket_name,folder_name,file_name) = file_urls[0].split("/")
162257

163258
# Create a storage client and use it to access the IDC's public data package
164259
client = storage.Client.create_anonymous_client()
165-
bucket = client.bucket("idc-open-data")
260+
bucket = client.bucket(bucket_name)
166261

167-
# This is the path (within the above bucket) to a segmentation of a CT series
168-
# containing a large number of different organs
262+
blob_name = f"{folder_name}/{file_name}"
169263
blob = bucket.blob(
170-
"3f38511f-fd09-4e2f-89ba-bc0845fe0005/c8ea3be0-15d7-4a04-842d-00b183f53b56.dcm"
264+
blob_name
171265
)
172266

173267
# Open the blob with "segread" using the "lazy frame retrieval" option
@@ -182,13 +276,16 @@ with blob.open(mode="rb") as reader:
182276
segment_numbers=selected_segment_numbers,
183277
combine_segments=True,
184278
)
279+
280+
# print dimensions of the liver segment volume
281+
print(volume.shape)
185282
```
186283

187284
See [this][11] page for more information on highdicom's `Image` class, and [this][12] page for the `Segmentation` class.
188285

189-
### The importance of offset tables
286+
### The importance of offset tables for SM modality
190287

191-
Achieving good performance for these frame-level retrievals requires the presence of a "Basic Offset Table" or "Extended Offset Table" in the file. These tables specify the starting positions of each frame within the file's byte stream. Without an offset table being present, libraries such as highdicom have to parse through the pixel data to find markers that tell it where frame boundaries are, which involves pulling down significantly more data and is therefore very slow. This mostly eliminates the potential speed benefits of frame-level retrieval. Unfortunately there is no simple way to know whether a file has an offset table without downloading the pixel data and checking it. If you find that an image takes a long time to load initially, it is probably because highdicom is constucting the offset table itself because it wasn't included in the file.
288+
Achieving good performance for the Slide Microscopy frame-level retrievals requires the presence of a "Basic Offset Table" or "Extended Offset Table" in the file. These tables specify the starting positions of each frame within the file's byte stream. Without an offset table being present, libraries such as highdicom have to parse through the pixel data to find markers that tell it where frame boundaries are, which involves pulling down significantly more data and is therefore very slow. This mostly eliminates the potential speed benefits of frame-level retrieval. Unfortunately there is no simple way to know whether a file has an offset table without downloading the pixel data and checking it. If you find that an image takes a long time to load initially, it is probably because highdicom is constucting the offset table itself because it wasn't included in the file.
192289

193290
Most IDC images do include an offset table, but some of the older pathology slide images do not. [This page][14] contains some notes about whether individual collections include offset tables.
194291

0 commit comments

Comments
 (0)