|
1 | 1 | # Introduction |
2 | 2 |
|
3 | | -{% hint style="info" %} |
4 | | -Check out the [IDC Getting started tutorial](https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks/getting\_started) for a quick introduction into data organization and main features of our repository! |
| 3 | +## Data sources |
5 | 4 |
|
6 | | -IDC data is replicated as [a public dataset in the Google Marketplace](https://console.cloud.google.com/marketplace/product/bigquery-public-data/nci-idc-data). You can see the summary dashboard of the dataset [here](https://datastudio.google.com/u/0/reporting/04cf5976-4ea0-4fee-a749-8bfd162f2e87). |
7 | | -{% endhint %} |
| 5 | +Most of the data in IDC is received from the data collection initiatives/projects supported by US National Cancer Institute. Whenever source images or image-derived data is not in the DICOM format, it is harmonized into DICOM as part of the ingestion.  |
| 6 | + |
| 7 | +IDC sources of data include: |
| 8 | + |
| 9 | +* [The Cancer Imaging Archive (TCIA) (ongoing)](https://www.cancerimagingarchive.net/) |
| 10 | + * all DICOM files from the public collections are mirrored in IDC |
| 11 | + * a subset of digital pathology collections and analysis results harmonized from vendor-specific representation (as available from TCIA) into DICOM Slide Microscopy (SM) format  |
| 12 | +* [Childhood Cancer Data Initiative (CCDI) (ongoing)](https://www.cancer.gov/research/areas/childhood/childhood-cancer-data-initiative) |
| 13 | + * digital pathology slides harmonized into DICOM SM |
| 14 | +* [Genomic Data Commons (GDC)](https://portal.gdc.cancer.gov/) |
| 15 | + * The Cancer Genome Atlas (TCGA) slides harmonized into DICOM SM |
| 16 | +* [Human Tumor Atlas Network (HTAN)](https://humantumoratlas.org/) |
| 17 | + * release 1 of the HTAN data harmonized into DICOM SM |
| 18 | +* [National Library of Medicine Visible Human Project](https://www.nlm.nih.gov/research/visible/visible_human.html) |
| 19 | + * v1 of the Visible Human images harmonized into DICOM MR/CT/XC |
| 20 | +* [Genotype-Tissue Expression Project (GTex)](https://commonfund.nih.gov/GTEx) |
| 21 | + * digital pathology slides harmonized into DICOM SM |
| 22 | + |
| 23 | +## Data provenance |
8 | 24 |
|
9 | | -Currently, IDC is hosting data from the following data repositories: |
| 25 | +Whenever IDC replicates data from a publicly available source, we include the reference to the origin: |
10 | 26 |
|
11 | | -* publicly available radiology collections and analysis results collections (in DICOM format) from The Cancer Imaging Archive (TCIA) |
12 | | -* whole slide pathology images (in [DICOM-TIFF format](../dicom/dicom-tiff-dual-personality-files.md)) collected by |
13 | | - * The Cancer Genome Atlas (TCGA) |
14 | | - * [Clinical Proteomic Tumor Analysis Consortium (CPTAC)](https://proteomics.cancer.gov/programs/cptac) |
15 | | - * [National Lung Screening Trial (NLST)](https://www.cancer.gov/types/lung/research/nlst) |
16 | | -* fluorescence images (in [DICOM-TIFF format](../dicom/dicom-tiff-dual-personality-files.md)) collected by the [Human Tumor Atlas Network (HTAN)](https://humantumoratlas.org/) |
| 27 | +* from the IDC Portal Explore page, click on the "i" icon next to the collection in the collections list  |
| 28 | + |
| 29 | +<figure><img src="../.gitbook/assets/image (52).png" alt=""><figcaption></figcaption></figure> |
| 30 | + |
| 31 | +* `source_doi` metadata column contains Digital Object Identifier (DOI) at the granularity of the individual files and is available both via [python `idc-index` package](https://github.com/ImagingDataCommons/idc-index) and BigQuery interfaces |
17 | 32 |
|
18 | 33 | {% hint style="info" %} |
19 | | -If you would like us to prioritize an existing public collection, which is not currently included in the IDC offering, please start the discussion on our [forum](https://discourse.canceridc.dev/c/data/8)! |
| 34 | +Whenever source data is harmonized into DICOM, the DOI will correspond to a Zenodo entry for the result of harmonization, which in turn will reference the location where data can be accessed in the native format (if available). As an example, IDC NLM-Visible-Human-Project collection refers to this DOI that describes the dataset resulting from the original dataset harmonized into DICOM [https://doi.org/10.5281/zenodo.12690049](https://doi.org/10.5281/zenodo.12690049), which in turn references the [NLM Visible Human project page](https://www.nlm.nih.gov/research/visible/visible_human.html) containing information on accessing the original files collected by the project. |
20 | 35 | {% endhint %} |
21 | 36 |
|
22 | 37 | Check out [Data release notes](data-release-notes.md) for information about the collections added in the individual IDC data releases. |
23 | 38 |
|
24 | | -In the following pages we discuss how to access datasets hosted by IDC and their organization. |
| 39 | +## Data ingestion process |
| 40 | + |
| 41 | +Simplified workflow for IDC data ingestion is summarized in the following diagram. |
| 42 | + |
| 43 | +{% embed url="https://docs.google.com/presentation/d/1UVpNVyVy3xIYLDnm4rtgAUmSu-uKQo5krekI9DSMT8o/edit?slide=id.g2fbbb94d529_0_76#slide=id.g2fbbb94d529_0_76" %} |
| 44 | +IDC data ingestion workflow |
| 45 | +{% endembed %} |
| 46 | + |
0 commit comments