Dataset Curation Guidelines for Loading and File Placement

This guide outlines the process for curating datasets and specifies the location where curation files should be placed for proper loading. By following these guidelines, you ensure datasets are prepared correctly for integration and that the files are stored in the appropriate directories for seamless access and management.

Curation Process: Steps for Preparing Datasets

This section describes the necessary steps to format, validate, and prepare datasets for loading into the system, ensuring they meet the required standards.

Dataset Curation Guidelines

This section will instruct curators on how to correctly fill in each of the columns required in a curated CSV. Examples have been provided based on the preparation of the linked dataset [here] (https://cellxgene.cziscience.com/collections/9b02383a-9358-4f0f-9795-a891ec523bcc) into a curated CSV ready for upload.

Columns and Guidelines

Dataset (individual datasets within larger group):
Description: The specific name of the dataset being curated within a larger dataset group.
Example: "Single cell transcriptional and chromatin accessibility profiling redefine cellular heterogeneity in the adult human kidney - ATACseq"
Full name dataset (top of page):
Description: The full descriptive name of the dataset that should be used for documentation and display.
Example: "Single cell transcriptional and chromatin accessibility profiling redefine cellular heterogeneity in the adult human kidney"
CxG Link:
Description: The CellxGene link to access the dataset.
Example: "https://cellxgene.cziscience.com/e/13a027de-ea3e-432b-9a5e-6bc7048498fc.cxg/"
h5ad link:
Description: The direct link to the .h5ad data file of the dataset.
Example: "https://datasets.cellxgene.cziscience.com/dabd979f-cc50-4526-81f3-8bc6c673ca36.h5ad"
Reference_DOI:
Description: The DOI reference for the associated publication(s) for the dataset.
Example: "DOI: 10.1038/s41467-021-22368-w"
Bionetworks reference:
Description: Indicate whether the dataset has a reference within the Bionetworks repository.
Example: "T" (True)
Standard category present? (T/F):
Description: Flag indicating whether standard categories are present in the dataset.
Example: "T" (True)
Standard category cell_type present? (T/F):
Description: Flag indicating whether the standard category for cell type is present in the dataset.
Example: "T" (True)
Author Category Cell Type Field Name:
Description: This column shows the name of the field as it appears in the Dataset Explorer UI. It indicates which specific field within the dataset corresponds to a certain category, such as "cell type" or other annotations. Fields marked as Cell types in the Content column play a key role in graph generation using the pandasaurus_cxg library, which is employed in the data pipeline.
Example: "author_cell_type"
Content:
- Description: This column indicates whether the field is used for cell type annotations or for other dataset annotations (e.g., Cell type or Other).
- Example: "Cell types"
Value type(s):
- Description: This column specifies if the values in the dataset are represented in full names or as abbreviations.
- Example: "abbreviations"
Notes:
- Description: Any additional notes or comments regarding the dataset.
- Example: "Only standard categories used"
Study Short Name:
- Description: The shortened name or acronym of the study associated with the dataset.
- Example: "Muto et al. (2021) Nat Commun"
CxG Dataset Collection X:
- Description: The CellxGene link to the collection where the dataset is stored.
- Example: "https://cellxgene.cziscience.com/collections/9b02383a-9358-4f0f-9795-a891ec523bcc"
Is the dataset Normal or Normal/Diseased:
- Description: Indicates whether the dataset includes normal samples, diseased samples or both.
- Example: "Normal"
Stage:
- Description: The biological stage of the samples in the dataset, such as adult, fetal, etc.
- Example: "Adult"

General Tips for Curators:

Ensure that fields marked as Cell types in the Content column are correctly paired with appropriate Author Category Cell Type Field Name, as these pairs are crucial for graph generation in the data pipeline using the pandasaurus_cxg library.
Ensure all links (CxG and h5ad) are correct and accessible.
Use consistent naming for datasets across related entries.
Double-check flags (T/F) to ensure they correctly reflect the presence of specific categories.
Fill out fields such as Study Short Name and Notes with proper references to aid in documentation and user clarity.

By following these guidelines, curators can ensure that datasets are correctly formatted and ready for integration into the pipeline.

File Placement: Where to Store Curation Files

This section provides guidance on the correct directory structure and file locations for placing curated datasets to ensure they are properly recognized and accessible during the loading process.

In the pipeline, curated CSV files are stored in the curated_data folder. When the pipeline is run, these CSV files are automatically converted into a YAML file named cxg_author_cell_type.yml and placed in the config folder. The YAML file maps the CxG links to the corresponding author_cell_type_list fields, which are essential for processing.

Example of the YAML format:

- CxG_link: https://datasets.cellxgene.cziscience.com/03af5481-a0b6-426c-86b4-9127ada17b53.h5ad
  author_cell_type_list:
  - author_cell_type
  - author_cluster_label
- CxG_link: https://datasets.cellxgene.cziscience.com/080f9be4-0f94-48cb-a82f-db53df1542ff.h5ad
  author_cell_type_list:
  - author_cluster_name
  - author_cell_type
  - author_cell_type

The CxG links in the YAML file are then used to download datasets into the dataset folder. Finally, the pandasaurus_cxg library is used to generate RDF graphs, which are stored in the graph folder for further use in the pipeline.