Datasets

Reusable benchmarking datasets and their file format specifications

To ensure interoperability between components, OpenProblems uses AnnData as the standard data format for both input and output files of components, and strict requirements are imposed on the format of these files.

Figure 1: AnnData objects have a structured format that includes the main data matrix (X, e.g. gene expression values), annotations of observations (obs, e.g. cell metadata), annotations of variables (var, e.g. gene metadata), and unstructured annotations (uns). This organization makes it easy to work with complex datasets while maintaining data integrity and ensuring a standardized structure across different components.

File format specifications

All OpenProblems tasks contain specifications for exact format of all H5AD inputs and outputs for all components in the workflow. These specifications contain information on the required and optional fields in the AnnData objects, as well as descriptions of those fields. These files are used to validate the input and output files of components, and to generate the documentation for the API of each component.

You should be able to find these specifications in the src/api/file_*.yaml of each task. Here’s an example of such a specification: src/datasets/api/file_raw.yaml.

For more information on how these specifications are formatted, see “Design the API”.

Common datasets

OpenProblems offers a collection of common datasets that can be used to test components and run the benchmarking tasks. These datasets are generated by dataset loaders and processed by a common processing pipeline stored in src/datasets.

graph LR
  normalization:::group
  dataset_processors:::group
  raw_dataset["Raw dataset"]:::anndata
  common_dataset[Common<br/>dataset]:::anndata
  test_dataset[Test<br/>dataset]:::anndata
  dataset_loader[/Dataset<br/>loader/]:::component
  subgraph normalization [Normalization methods]
    log_cp10k[/"Log CP10k"/]:::component
    l1_sqrt[/"L1 sqrt"/]:::component
    log_scran_pooling[/"Log scran<br/>pooling"/]:::component
    sqrt_cp10k[/Sqrt CP10k/]:::component
  end
  subgraph dataset_processors[Dataset processors]
    hvg[/HVG/]:::component
    pca[/PCA/]:::component
    knn[/KNN/]:::component
  end
  dataset_loader --> raw_dataset --> log_cp10k & l1_sqrt & log_scran_pooling & sqrt_cp10k --> hvg --> pca --> knn --> common_dataset
  subset[/Subset/]:::component
  common_dataset --> subset --> test_dataset

Figure 2: Overview of the dataset processing workflow. Legend: Grey rectangles are AnnData .h5ad files, purple rhomboids are Viash components.

File format of common datasets

The format of common datasets is based on the CELLxGENE schema along with additional metadata that is specific to OpenProblems (in the .uns slot) and some additional output generated by our dataset preprocessors (in the .layers, .obsm, obsp and .varm slots).

Here is what a typical common dataset looks like when printed to the console:

File format: Common dataset

A dataset processed by the common dataset processing pipeline.

Example file: resources_test/common/pancreas/dataset.h5ad

Description:

This dataset contains both raw counts and normalized data matrices, as well as a PCA embedding, HVG selection and a kNN graph.

Format:

AnnData object
 obs: 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'organism', 'organism_ontology_term_id', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'batch', 'soma_joinid', 'size_factors'
 var: 'feature_id', 'feature_name', 'soma_joinid', 'hvg', 'hvg_score'
 obsm: 'X_pca'
 obsp: 'knn_distances', 'knn_connectivities'
 varm: 'pca_loadings'
 layers: 'counts', 'normalized'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'normalization_id', 'pca_variance', 'knn'

Slot description:

Slot	Type	Description
`obs["dataset_id"]`	`string`	(Optional) Identifier for the dataset from which the cell data is derived, useful for tracking and referencing purposes.
`obs["assay"]`	`string`	(Optional) Type of assay used to generate the cell data, indicating the methodology or technique employed.
`obs["assay_ontology_term_id"]`	`string`	(Optional) Experimental Factor Ontology (`EFO:`) term identifier for the assay, providing a standardized reference to the assay type.
`obs["cell_type"]`	`string`	(Optional) Classification of the cell type based on its characteristics and function within the tissue or organism.
`obs["cell_type_ontology_term_id"]`	`string`	(Optional) Cell Ontology (`CL:`) term identifier for the cell type, offering a standardized reference to the specific cell classification.
`obs["development_stage"]`	`string`	(Optional) Stage of development of the organism or tissue from which the cell is derived, indicating its maturity or developmental phase.
`obs["development_stage_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the developmental stage, providing a standardized reference to the organism’s developmental phase. If the organism is human (`organism_ontology_term_id == 'NCBITaxon:9606'`), then the Human Developmental Stages (`HsapDv:`) ontology is used. If the organism is mouse (`organism_ontology_term_id == 'NCBITaxon:10090'`), then the Mouse Developmental Stages (`MmusDv:`) ontology is used. Otherwise, the Uberon (`UBERON:`) ontology is used.
`obs["disease"]`	`string`	(Optional) Information on any disease or pathological condition associated with the cell or donor.
`obs["disease_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the disease, enabling standardized disease classification and referencing. Must be a term from the Mondo Disease Ontology (`MONDO:`) ontology term, or `PATO:0000461` from the Phenotype And Trait Ontology (`PATO:`).
`obs["donor_id"]`	`string`	(Optional) Identifier for the donor from whom the cell sample is obtained.
`obs["is_primary_data"]`	`boolean`	(Optional) Indicates whether the data is primary (directly obtained from experiments) or has been computationally derived from other primary data.
`obs["organism"]`	`string`	(Optional) Organism from which the cell sample is obtained.
`obs["organism_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the organism, providing a standardized reference for the organism. Must be a term from the NCBI Taxonomy Ontology (`NCBITaxon:`) which is a child of `NCBITaxon:33208`.
`obs["self_reported_ethnicity"]`	`string`	(Optional) Ethnicity of the donor as self-reported, relevant for studies considering genetic diversity and population-specific traits.
`obs["self_reported_ethnicity_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the self-reported ethnicity, providing a standardized reference for ethnic classifications. If the organism is human (`organism_ontology_term_id == 'NCBITaxon:9606'`), then the Human Ancestry Ontology (`HANCESTRO:`) is used.
`obs["sex"]`	`string`	(Optional) Biological sex of the donor or source organism, crucial for studies involving sex-specific traits or conditions.
`obs["sex_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the biological sex, ensuring standardized classification of sex. Only `PATO:0000383`, `PATO:0000384` and `PATO:0001340` are allowed.
`obs["suspension_type"]`	`string`	(Optional) Type of suspension or medium in which the cells were stored or processed, important for understanding cell handling and conditions.
`obs["tissue"]`	`string`	(Optional) Specific tissue from which the cells were derived, key for context and specificity in cell studies.
`obs["tissue_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the tissue, providing a standardized reference for the tissue type. For organoid or tissue samples, the Uber-anatomy ontology (`UBERON:`) is used. The term ids must be a child term of `UBERON:0001062` (anatomical entity). For cell cultures, the Cell Ontology (`CL:`) is used. The term ids cannot be `CL:0000255`, `CL:0000257` or `CL:0000548`.
`obs["tissue_general"]`	`string`	(Optional) General category or classification of the tissue, useful for broader grouping and comparison of cell data.
`obs["tissue_general_ontology_term_id"]`	`string`	(Optional) Ontology term identifier for the general tissue category, aiding in standardizing and grouping tissue types. For organoid or tissue samples, the Uber-anatomy ontology (`UBERON:`) is used. The term ids must be a child term of `UBERON:0001062` (anatomical entity). For cell cultures, the Cell Ontology (`CL:`) is used. The term ids cannot be `CL:0000255`, `CL:0000257` or `CL:0000548`.
`obs["batch"]`	`string`	(Optional) A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.
`obs["soma_joinid"]`	`integer`	(Optional) If the dataset was retrieved from CELLxGENE census, this is a unique identifier for the cell.
`obs["size_factors"]`	`double`	(Optional) The size factors created by the normalisation method, if any.
`var["feature_id"]`	`string`	(Optional) Unique identifier for the feature, usually a ENSEMBL gene id.
`var["feature_name"]`	`string`	A human-readable name for the feature, usually a gene symbol.
`var["soma_joinid"]`	`integer`	(Optional) If the dataset was retrieved from CELLxGENE census, this is a unique identifier for the feature.
`var["hvg"]`	`boolean`	Whether or not the feature is considered to be a ‘highly variable gene’.
`var["hvg_score"]`	`double`	A score for the feature indicating how highly variable it is.
`obsm["X_pca"]`	`double`	The resulting PCA embedding.
`obsp["knn_distances"]`	`double`	K nearest neighbors distance matrix.
`obsp["knn_connectivities"]`	`double`	K nearest neighbors connectivities matrix.
`varm["pca_loadings"]`	`double`	The PCA loadings matrix.
`layers["counts"]`	`integer`	Raw counts.
`layers["normalized"]`	`double`	Normalised expression values.
`uns["dataset_id"]`	`string`	A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived.
`uns["dataset_name"]`	`string`	A human-readable name for the dataset.
`uns["dataset_url"]`	`string`	(Optional) Link to the original source of the dataset.
`uns["dataset_reference"]`	`string`	(Optional) Bibtex reference of the paper in which the dataset was published.
`uns["dataset_summary"]`	`string`	Short description of the dataset.
`uns["dataset_description"]`	`string`	Long description of the dataset.
`uns["dataset_organism"]`	`string`	(Optional) The organism of the sample in the dataset.
`uns["normalization_id"]`	`string`	Which normalization was used.
`uns["pca_variance"]`	`double`	The PCA variance objects.
`uns["knn"]`	`object`	Supplementary K nearest neighbors data.

Some slots might not be available depending on the origin of the dataset. Please visit the reference documentation on the common dataset file format used by OpenProblems for more information on each of the different slots.

Note

In OpenProblems, the X slot in the AnnData objects is typically not defined (None in Python, NULL in R). Instead, the raw counts and normalised expression data are defined as layers.

Available datasets

Our datasets are stored in s3://openproblems-data/resources/datasets. Please visit the datasets page for more information on each of the available datasets.