Metadata Standards
Metadata means data about data. Metadata enables both data searchability and interpretability.
Metadata can be divided into broad categories:
- clinical metadata;
- biospecimen metadata; and
- assay file metadata.
Metadata Conceptual Diagram
Although the following diagram (Figure 1) is an oversimplification, it can help one conceptualize what the HTAN Metadata is.
- Each headered rectangle in the diagram is a separate table or "manifest" which contains a set of attributes.
- Attributes in the tables include identifiers such as the HTAN_PARENT_ID which help connect the data together. Please see the Relationship Model Page in this manual for more specific information about connecting data together.
Figure 1. Metadata can be conceptually thought of as a series of relational database tables
Clinical Metadata is organized into Tiers.
Clinical metadata is organized into tiers. The structure of these tiers differs in Phase 1 and Phase 2 of HTAN. In Phase 1 there were three clinical data tiers. In Phase 2, there are two. For both phases, Tier 1 represents clinical data which is generally common to all studies and Atlases. Higher tiers are extensions to Tier 1, some of which are cancer or study-specific.
Phase 2 Clinical Metadata Tiers
In HTAN Phase 2, there are only two clinical metadata tiers.
- Tier 1 metadata is divided into multiple categories, including Demographics, Diagosis, Family History, Exposure, Molecular Test, Therapy, Follow Up and Vital Status.
- Tier 2 contains any cancer or study-specific clinical information which is not represented in Tier 1. Tier 2 is a flexible comma-separated value (csv) file. The only required attribute is HTAN Participant ID. All other attributes (columns headers) are determined by the submitting Center.
Figure 2 provides a general representation of the model.
Phase 1 Clinical Metadata Tiers
In HTAN Phase 1, Tier 1 clinical metadata was based on the NCI's Genomic Data Commons (GDC) clinical data model. Phase 1 clinical data was divided into three tiers:
- Similar to HTAN Phase 2, Tier 1 metadata was divided into multiple categories, including Demographics, Diagosis, Family History, Exposure, Molecular Test, Therapy, Follow Up and Vital Status Update. However, the attributes, valid values and requirements differ between Phase 1 and Phase 2 for each of these Tier 1 clinical metadata categories.
- Phase 1 Tiers 2 and 3 are disease-agnostic (Tier 2) and disease-specific (Tier 3) extensions to the GDC model.
These tiers are shown in figure 3 and are described more on the Phase 1 Clinical Data Page.
Original vs Derived Biospecimen
Biospecimen metadata includes the original biopsy or surgical specimen as well as any derived specimen (e.g. a tissue section or slide) which were subsequently used for an assay.
- Derived specimen connect to originating specimen via HTAN_PARENT_ID. (The derived specimen's HTAN_PARENT_ID is the HTAN_BIOSPECIMEN_ID of the originating specimen.)
- An originating specimen's HTAN_PARENT_ID would be the HTAN_PARTICIPANT_ID.
Assay File Metadata
HTAN divides assay data files into levels which increase from level 1 (raw data) to level 4 (derived cohort-level data). Please see the File Standards Page for more information about assay data levels.
Assay file metadata corresponds to each assay file level. For example, Whole Exome Sequencing (WES) data currently has 3 file levels. As a result, there are 3 levels of WES assay metadata which are collected -- one for each file level. The metadata are referenced in this manner e.g. "WES - Level 1", "WES - Level 2", "WES - Level 3" in the HTAN Phase 2 Data Model and "Bulk DNA Level 1", "Bulk DNA Level 2", "Bulk DNA Level 3" in the HTAN Phase 1 Model.
The HTAN DCC maintains a set of code in github repositories (one for each Phase of HTAN) to document and validate metadata. Specific information about what attributes are collected, which attributes are required and valid values are also provided to help data contributors. These resources may also be helpful to data users.
Please click on the panels below to download files which describe Phase 1 Metadata Attributes.
KEY for All downloadable Phase 1 Metadata files.
Download Phase 1 Clinical Attributes
Phase 1 HTAN clinical data consists of three tiers. Tier 1 is based on the NCI Genomic Data Commons (GDC) clinical data model, while Tiers 2 and 3 are extensions to the GDC model.
Tier 1 Clinical Data
Tier 2 and 3 Clinical Data
Tier 2 consists of disease-agnostic extensions to the GDC clinical data model.
Tier 3 consists of disease-specific extensions to the GDC clinical data model. This covers additional elements for Acute Lymphoblastic Leukemia (ALL), Brain Cancer, Breast Cancer, Lung Cancer, Melanoma, Ovarian Cancer, Pancreatic Cancer, Prostate Cancer and Sarcoma.
Download Phase 1 Biospecimen Attributes
The HTAN biospecimen data model is designed to capture essential biospecimen data elements, including:
Acquisition method, e.g. autopsy, biopsy, fine needle aspirate, etc. Topography Code, indicating site within the body, e.g. based on ICD-O-3. Collection information e.g. time, duration of ischemia, temperature, etc. Processing of parent biospecimen information e.g. fresh, frozen, etc. Biospecimen and derivative clinical metadata i.e. Histologic Morphology Code, e.g. based on ICD-O-3. Coordinates for derivative biospecimen from their parent biospecimen. Processing of derivative biospecimen for downstream analysis e.g. dissociation, sectioning, analyte isolation, etc.
Phase 1 HTAN biospecimen metadata leveraged existing common data elements from four sources:
Genomic Data Commons (GDC) Consortium for Molecular and Cellular Characterization of Screen-Detected Lesions (MCL) Human Cell Atlas (HCA) NCI standards described in the caDSR system
Download Phase 1 Imaging Attributes
The HTAN data model for imaging data is based upon the Minimum Information about Tissue Imaging (MITI) reporting guidelines. These comprise minimal metadata for highly multiplexed tissue images and were developed in consultation with methods developers, experts in imaging metadata (e.g., DICOM and OME) and multiple large-scale atlasing projects; they are guided by existing standards and accommodate most multiplexed imaging technologies and both centralized and distributed data storage.
For further information on the MITI guidelines, please see the MITI website, specification on Github, and Nature Methods publication.
The HTAN data model for imaging was intended primarily for multiplexed imaging, such as CODEX, CyCIF, and IMC, in addition to brightfield imaging of H&E stained tissues.
Download Phase 1 Mass Spectrometry Attributes
The Phase 1 HTAN Mass Spectrometry Standard was developed with a focus on Proteomics data.
Download Phase 1 Sequencing Attributes
In alignment with The Cancer Genome Atlas and the NCI Genomic Data Commons, sequencing data are divided into four levels:
Download Phase 1 Spatial Transcriptomics Attributes
Support for several spatial sequencing modalities was added near the end of HTAN Phase 1. Spatial transcriptomics assays had platform-specific data levels which deviated from the traditional HTAN file level schema. Specifically,
- some platforms have only one experiment level instead of multiple file levels (10X Xenium ISS and Nanostring CosMx SMI); and
- some platforms have additional files (auxiliary or annotation metadata) which were not a part of the typical 4 level system (e.g. 10X Visium, Nanostring GeoMx DSP).
The HTAN DCC maintains a set of code in github repositories (one for each Phase of HTAN) to document and validate metadata. Specific information about what attributes are collected, which attributes are required and valid values are also provided to help data contributors. These resources may also be helpful to data users.