#
Identifiers
All research participants, biospecimens and derived data within HTAN are associated with unique HTAN identifiers. The relationship between the identifiers is visually represented in Figure 1.
If you will be creating HTAN identifiers for an HTAN Center or Trans Network Project (TNP), please see the step-by-step directions in the Creating Identifiers section of this manual.
#
ID to ID linkages
Note that the explicit linking of participants to biospecimens to data files is not encoded in the HTAN Identifier. Rather, the linking is encoded in explicit metadata elements (see Relationship Model).
#
Phase 2 vs Phase 1 HTAN Identifiers
Small improvements to the HTAN Identifier system were introduced in Phase 2 of HTAN. Figure 1 represents some elements of the new identifier
system. Namely, the inclusion of a “B” or a “D” in HTAN identifiers to distinguish biospecimen from data files. Please see
Figure 4 in
#
Phase 2 HTAN IDs
#
Phase 2 Participant IDs
Research participants are identified with the following pattern:
<participant_id> ::= <htan_center_id>_integer
Where the htan_center_id is the HTAN Center Prefix. (e.g. HTA200, HTA201) Please see HTAN Centers for a full list of HTAN Center prefixes.
#
Phase 2 Biospecimen and Data File IDs
Biospecimens such as samples, tissue blocks, slides, aliquots and analytes obtained from a research participant have identifiers which follow the pattern:
<biospecimen_entity_id> ::= <participant_id>_Binteger
where the "B" before the integer denotes "Biospecimen".
For example, if research participant 1 within the Yale Lymphoma atlas (HTA209) provided three samples, you would have three biospecimen HTAN IDs:
HTA209_1_B1
HTA209_1_B3
HTA209_1_B8
Data files that result from those biospecimens have identifiers which follow the pattern:
<datafile_entity_id> ::= <participant_id>_Dinteger
where the "D" before the integer denotes "Data File".
For example, if an assay was performed on a biospecimen from the same Yale Lymphoma atlas (HTA209) participant, the data files would have HTAN IDs such as:
HTA209_1_D12
HTA209_1_D15
#
Phase 2 Special Identifiers
#
Pooled samples and pooled files
If a biospecimen or data file is derived from more than one research participant, the biospecimen or data file identifier should use '0000' after the HTAN Center Prefix.
Figure 2 demonstrates use of '0000' for a pooled data file.
Figure 3 demonstrates use of '0000' for pooled biospecimen and data files.
#
Control/Blank samples
HTAN identifiers which contain 'EXT' indicate that the biospecimen or data file was either derived from an external control participant or a blank control.
Examples:
HTA209_EXT1_B1
HTA209_EXT2_D34
HTA209_EXT3_D590
#
Phase 2 Regex Validation
These regular expressions validate HTAN identifiers by enforcing a specific prefix range (HTA200–HTA229), a middle identifier (numeric or EXT-based), and specific suffix rules for data files and biospecimens.
#
HTAN Data File ID
Regex:
^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(D[0-9]{1,20})$
Examples:
- ✅
HTA201_12345_D1 - ❌
HTA201_12345_B1(Ends with _B instead of _D) - ❌
HTA250_12345_D1(Prefix HTA250 is out of valid range)
#
HTAN Participant ID
Regex:
^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})$
Examples:
- ✅
HTA210_EXT999 - ❌
HTA210_EXT999_B1(Contains a suffix, which is not allowed) - ❌
HTA199_EXT999(Prefix HTA199 is out of valid range)
#
HTAN Biospecimen ID
Regex:
^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_(B[0-9]{1,20})$
Examples:
- ✅
HTA220_55555_B2 - ❌
HTA220_55555_D2(Ends with _D instead of _B) - ❌
HTA220_55555(Missing the mandatory _B suffix)
#
HTAN Parent ID (from biospecimen)
Matches a Participant ID OR a Biospecimen ID.
Regex:
^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})(?:_(B[0-9]{1,20}))?$
Examples:
- ✅
HTA205_1001_B5 - ✅
HTA205_1001(Valid Participant ID used as parent) - ❌
HTA205_1001_D5(Contains _D suffix; only no suffix or _B allowed) - ❌
HTA205_(Missing the middle ID number section)
#
HTAN Parent ID (from core)
Matches a Biospecimen ID OR a Data File ID.
Regex:
^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$
Examples:
- ✅
HTA215_777_D1 - ❌
HTA215_777(Missing mandatory suffix; must be _B or _D) - ❌
HTA215_777_A1(Suffix _A is invalid)
#
Regex Structure Explanation
The following breakdown uses the HTAN Parent ID (from core) as an example, as it contains all component parts used across the identifiers.
Pattern: ^(?=.{1,50}$)(HTA2[0-2][0-9])_(0000|EXT[0-9]{1,18}|[0-9]{1,21})_([BD][0-9]{1,20})$
^(?=.{1,50}$)- Start of string (
^): Ensures the match starts at the very beginning. - Lookahead (
(?=...)): Checks that the total length of the string is between 1 and 50 characters before proceeding with the specific matching.
- Start of string (
(HTA2[0-2][0-9])- Center ID: Matches the literal
HTAfollowed by a number range strictly between 200 and 229 (2followed by0-2, followed by0-9).
- Center ID: Matches the literal
_- Separator: Literal underscore character separating the Center ID from the Participant ID.
(0000|EXT[0-9]{1,18}|[0-9]{1,21})- Participant ID: Matches one of three valid formats:
0000(Standard zero ID)EXTfollowed by 1 to 18 digits (External ID)- 1 to 21 digits (Standard numeric ID)
- Participant ID: Matches one of three valid formats:
_- Separator: Literal underscore character.
([BD][0-9]{1,20})$- Suffix & End: Matches either
B(Biospecimen) orD(Data File), followed by 1 to 20 digits. - End of string (
$): Ensures there are no extra characters after the ID.
- Suffix & End: Matches either
#
Phase 1 HTAN IDs
#
Phase 1 Participant IDs
Research participants are identified with the following pattern:
<participant_id> ::= <htan_center_id>_integer
Where the htan_center_id is the HTAN Center prefix. (e.g. HTA1, HTA2) Please see HTAN Centers for a full list of HTAN Center prefixes.
#
Phase 1 Biospecimen and Data File IDs
Derivative data includes anything derived from a research participant, including biospecimens such as samples, tissue blocks, slides, aliquots, analytes, and data files that result from assaying those biospecimens. These identifiers follow the pattern:
<derivative_entity_id> ::= <participant_id>_integer
For example, if research participant 1 within the CHOP project (HTA4) has provided three samples, you would have three HTAN IDs, such as:
HTA4_1_1
HTA4_1_3
HTA4_1_8
#
Phase 1 Special Identifiers
If a single data file is generated from one of those samples, that file could have an HTAN ID such as:
HTA4_1_42
If a single data file is derived from more than one participant, the file identifier may contain a wildcard string e.g. ‘0000’, after the HTAN center identifier. For example:
HTA4_0000_1
HTA4_0000_2
HTA4_0000_3
If a data file is derived from an external control participant, the biospecimen and file identifiers will contain the string ‘EXT’ before the external control participant integer. For example:
HTA4_EXT1_1
HTA4_EXT2_2
HTA4_EXT3_3