Manual

# Phase 2 sc/snRNA-seq Data Submission

# Overview

In HTAN Phase 2, the following files are submitted for single cell/single nuclei RNA-sequencing (sc/snRNA-seq) data:

Level Data Type Example Files
1 raw sequence data fastq, unaligned bam
2 aligned sequence data bam
3_4 sample level summary information, e.g. cell annotations, t-SNE/UMAP coordinates, etc. h5ad

Metadata requirements are documented in the HTAN Data Model readthedocs pages. This part of the manual describes file requirements for level 3_4 h5ad files.

The HTAN h5ad (AnnData 0.10) requirements are modeled after CELLxGENE's requirements. They also include three attributes developed by the Human Cell Atlas (HCA). Please see the Background section below for more information about h5ad (AnnData 0.10) files, CELLxGENE and the HCA.

# Required File Attributes

Similar to CELLxGENE's Dataset Requirements, level 3_4 sc/snRNA-seq h5ad files must contain the following attributes. Please see rHTAN_h5ad_exemplar_2025_03_03.h5ad for an example file which meets these requirements.

Attribute Ontology/Version Comments/Examples
var.index, raw.var.index GENCODE; Please see CELLxGENE's latest schema for version information ENSEMBL IDs. For example: ENSG00000107566.
var.gene_is_filtered, raw.var.gene_id_filtered no genes filtered in raw data; if gene is filtered in normalized data, count is set to 0 and gene_is_filtered set to 1.
obs.organism_ontology_term_id NCBITaxon Set to NCBITaxon:9606 for human.
obs.donor_id Set to the HTAN Participant ID, e.g. HTA201_1.
obs.sample_id Set to the HTAN Biospecimen ID, e.g. HTA201_1_B.
obs.development_stage_ontology_term_id Human Development Stages (HsapDv) use HHCA recommended terms (p.22)
obs.sex_ontology_term_id Phenotype and Trait Ontology (PATO) Use CELLxGENE Requirements PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable.
obs.self_reported_ethnicity_term_id Human Ancestry Ontology (HANCESTRO) Use CELLxGENE Requirements. HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If information is unavailable, use unknown. Example: HANCESTRO_0568. Note that CELLxGENE specifically excludes certain HANCESTRO categories. See full details.
obs.disease_ontology_term_id Mondo Disease Ontology
obs.tissue_type CELLxGENE Use CELLxGENE Requirements, Permitted values are restricted to: tissue, organoid, or cell culture.
obs.tissue_ontology_term_id Uber Anatomy Ontology (UBERON)
obs.cell_type_ontology_term_id Cell Ontology (CL)
obs.assay_ontology_term_id Experimental Factor Ontology (EFO) Use CELLxGENE Requirements
obs.suspension_type CELLxGENE Use CELLxGENE Requirements. Permitted values are restricted to: cell, nucleus, na.
obs.is_primary_data CELLxGENE. Used to indicate if this is the canonical data set (True), or data is being reused from another source (False). Use CELLxGENE Requirements. Permitted values are restricted to True or False.
obs.cell_enrichment Human Cell Atlas: “Specifies the cell types targeted for enrichment or depletion beyond the selection of live cells.“ CL term, followed by + or -. If no enrichment. Then use CL:00000000. For example, enrichment for fibroblasts would be CL:0000057+
obs.intron_inclusion Human Cell Atlas: “Were introns included during read counting in the alignment process?” Permitted values are: yes, no
obs.author_cell_type Human Cell Atlas: “Encoding of author intuition of cellular annotation in the dataset.” Free text
obsm.X_(suffix) CELLxGENE: embeddings of at least two dimensions, e.g. tSNE, UMAP, PCA, spatial coordinates use CELLxGENE terms for suffix (e.g. umap, tsne, pca)

# HTAN h5ad File Validation

The HTAN Data Coordinating Center (DCC) has released a PyPi package called HTAN-h5ad-validator with which Centers can validate their sc/snRNA-seq h5ad files. Sage Bionetworks will run the validator on sc/snRNA-seq h5ad files submitted to Synapse.

# Background: h5ad files, CELLxGENE, Human Cell Atlas

# h5ad (AnnData 0.10) brief overview

Please see AnnData’s documentation for a more detailed description of the AnnData object.

From https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg
From https://raw.githubusercontent.com/scverse/anndata/main/docs/_static/img/anndata_schema.svg

For HTAN’s purposes, the following parts of the AnnData object are of interest:

  • .X - a matrix with counts where rows are cells and columns are genes.
  • var - a matrix with gene information (e.g. gene name, gene_is_filtered).
  • obs - a matrix with cell-level information.
  • obsm - one or more numpy ndarrays with cell embeddings.

CELLxGENE requires that raw data are submitted. Normalized data may also be submitted.

# CELLxGENE

The HTAN DCC submits sc/snRNA-seq data to CELLxGENE, a tool developed by the Chan Zuckerberg Initiative (CZI) to visualize and explore single cell and spatial data. The DCC submits data to CellxGene in h5ad (AnnData 0.10) format. CELLxGENE’s schema requires:

  • use of Ensembl gene IDs.
  • a specific genome reference and annotation version.
  • specific h5ad (AnnData 0.10) attributes.
  • use of specific ontologies for many of the required attributes (i.e. cell ontology).

The HTAN requirements for h5ad files are modeled after CELLxGENE's Dataset Requirements.

# Human Cell Atlas (HCA)

The Human Cell Atlas (HCA) is a large repository of single cell data from healthy subjects. It provides standards for single-cell data submission which adopt most of the CELLxGENE schema, but also include additional fields. Aligning HTAN data with CELLxGENE will potentially facilitate data integration with other consortia such as the HCA. The HTAN requirements include three HCA attributes in addition to CELLxGENE required attributes.