Manual

# Identifiers

All research participants, biospecimens and derived data within HTAN are associated with unique HTAN identifiers. The relationship between the identifiers is visually represented in Figure 1.

Figure 1. Phase 2 HTAN ID Provenance
Figure 1. Phase 2 HTAN ID Provenance

# ID to ID linkages

Note that the explicit linking of participants to biospecimens to data files is not encoded in the HTAN Identifier. Rather, the linking is encoded in explicit metadata elements (see Relationship Model).

# Phase 2 vs Phase 1 HTAN Identifiers

Small improvements to the HTAN Identifier system were introduced in Phase 2 of HTAN. Figure 1 represents some elements of the new identifier system. Namely, the inclusion of a “B” or a “D” in HTAN identifiers to distinguish biospecimen from data files. Please see Figure 4 in Phase 1 HTAN ID Provenance for comparison. The specific regex patterns used to validate Phase 2 identifiers is included in the Phase 2 Regex Validation section.

# Phase 2 HTAN IDs

# Phase 2 Participant IDs

Research participants are identified with the following pattern:

<participant_id> ::= <htan_center_id>_integer

Where the htan_center_id is the HTAN Center Prefix. (e.g. HTA200, HTA201) Please see HTAN Centers for a full list of HTAN Center prefixes.

# Phase 2 Biospecimen and Data File IDs

Biospecimens such as samples, tissue blocks, slides, aliquots and analytes obtained from a research participant have identifiers which follow the pattern:

<biospecimen_entity_id>	::= <participant_id>_Binteger

where the "B" before the integer denotes "Biospecimen".

For example, if research participant 1 within the Yale Lymphoma atlas (HTA209) provided three samples, you would have three biospecimen HTAN IDs:

HTA209_1_B1
HTA209_1_B3
HTA209_1_B8

Data files that result from those biospecimens have identifiers which follow the pattern:

<datafile_entity_id>	::= <participant_id>_Dinteger

where the "D" before the integer denotes "Data File".

For example, if an assay was performed on a biospecimen from the same Yale Lymphoma atlas (HTA209) participant, the data files would have HTAN IDs such as:

HTA209_1_D12
HTA209_1_D15

# Phase 2 Special Identifiers

# Pooled samples and pooled files

If a biospecimen or data file is derived from more than one research participant, the biospecimen or data file identifier should use '0000' after the HTAN Center Prefix.

Figure 2 demonstrates use of '0000' for a pooled data file.

Figure 2. Phase 2 Pooled Data File Example
Figure 2. Phase 2 Pooled Data File Example

Figure 3 demonstrates use of '0000' for pooled biospecimen and data files.

Figure 3. Phase 2 Pooled Biospecimen and Data File Example
Figure 3. Phase 2 Pooled Biospecimen and Data File Example

# Control/Blank samples

HTAN identifiers which contain 'EXT' indicate that the biospecimen or data file was either derived from an external control participant or a blank control.

Examples:

HTA209_EXT1_B1
HTA209_EXT2_D34
HTA209_EXT3_D590

# Phase 2 Regex Validation

These regular expressions validate HTAN identifiers by enforcing a specific prefix range (HTA200–HTA229), a middle identifier (numeric or EXT-based), and specific suffix rules for data files and biospecimens.

# HTAN Data File ID

Regex: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(D[0-9]{1,20})$

Examples:

  • HTA201_12345_D1
  • HTA201_12345_B1 (Ends with _B instead of _D)
  • HTA250_12345_D1 (Prefix HTA250 is out of valid range)

# HTAN Participant ID

Regex: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})$

Examples:

  • HTA210_EXT999
  • HTA210_EXT999_B1 (Contains a suffix, which is not allowed)
  • HTA199_EXT999 (Prefix HTA199 is out of valid range)

# HTAN Biospecimen ID

Regex: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(B[0-9]{1,20})$

Examples:

  • HTA220_55555_B2
  • HTA220_55555_D2 (Ends with _D instead of _B)
  • HTA220_55555 (Missing the mandatory _B suffix)

# HTAN Parent ID (from biospecimen)

Matches a Participant ID OR a Biospecimen ID.

Regex: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})(?:_(B[0-9]{1,20}))?$

Examples:

  • HTA205_1001_B5
  • HTA205_1001 (Valid Participant ID used as parent)
  • HTA205_1001_D5 (Contains _D suffix; only no suffix or _B allowed)
  • HTA205_ (Missing the middle ID number section)

# HTAN Parent ID (from core)

Matches a Biospecimen ID OR a Data File ID.

Regex: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$

Examples:

  • HTA215_777_D1
  • HTA215_777 (Missing mandatory suffix; must be _B or _D)
  • HTA215_777_A1 (Suffix _A is invalid)

# Regex Structure Explanation

The following breakdown uses the HTAN Parent ID (from core) as an example, as it contains all component parts used across the identifiers.

Pattern: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$

  1. ^(?=.{1,50}$)

    • Start of string (^): Ensures the match starts at the very beginning.
    • Lookahead ((?=...)): Checks that the total length of the string is between 1 and 50 characters before proceeding with the specific matching.
  2. (HTA2[0-2][0-9])

    • Center ID: Matches the literal HTA followed by a number range strictly between 200 and 229 (2 followed by 0-2, followed by 0-9).
  3. _

    • Separator: Literal underscore character separating the Center ID from the Participant ID.
  4. (0000|EXT[0-9]{1,18}|[0-9]{1,21})

    • Participant ID: Matches one of three valid formats:
      • 0000 (Standard zero ID)
      • EXT followed by 1 to 18 digits (External ID)
      • 1 to 21 digits (Standard numeric ID)
  5. _

    • Separator: Literal underscore character.
  6. ([BD][0-9]{1,20})$

    • Suffix & End: Matches either B (Biospecimen) or D (Data File), followed by 1 to 20 digits.
    • End of string ($): Ensures there are no extra characters after the ID.

# Phase 1 HTAN IDs

# Phase 1 Participant IDs

Figure 4. Phase 1 HTAN ID Provenance
Figure 4. Phase 1 HTAN ID Provenance

Research participants are identified with the following pattern:

<participant_id> ::= <htan_center_id>_integer

Where the htan_center_id is the HTAN Center prefix. (e.g. HTA1, HTA2) Please see HTAN Centers for a full list of HTAN Center prefixes.

# Phase 1 Biospecimen and Data File IDs

Derivative data includes anything derived from a research participant, including biospecimens such as samples, tissue blocks, slides, aliquots, analytes, and data files that result from assaying those biospecimens. These identifiers follow the pattern:

<derivative_entity_id>	::= <participant_id>_integer

For example, if research participant 1 within the CHOP project (HTA4) has provided three samples, you would have three HTAN IDs, such as:

HTA4_1_1
HTA4_1_3
HTA4_1_8

# Phase 1 Special Identifiers

If a single data file is generated from one of those samples, that file could have an HTAN ID such as:

HTA4_1_42

If a single data file is derived from more than one participant, the file identifier may contain a wildcard string e.g. ‘0000’, after the HTAN center identifier. For example:

HTA4_0000_1
HTA4_0000_2
HTA4_0000_3

If a data file is derived from an external control participant, the biospecimen and file identifiers will contain the string ‘EXT’ before the external control participant integer. For example:

HTA4_EXT1_1
HTA4_EXT2_2
HTA4_EXT3_3