CellO: Comprehensive and hierarchical cell type classification of human cells with the Cell Ontology

Expression data:

Quantified gene expression for all bulk RNA-seq samples used in this study are available as an HDF5 file (in log transcripts per million): bulk_log_tpm.h5

Each bulk RNA-seq experiment accession is mapped to a set of cell type labels from the Cell Ontology: bulk_labels.json

The single-cell data used in this study are also available as an HDF5 file (in log transcripts per million): single_cell_log_tpm.h5

Each single-cell experiment accession is mapped to a set of cell type labels from the Cell Ontology: single_cell_labels.json

Dataset partitions:

We partitioned the bulk RNA-seq data into several subsets that were used for various purposes in the study:

Technical variable annotations:

We annotated 27,097 RNA-seq samples in the Sequence Read Archive (SRA) with technical variables in order to derive a set of primary, healthy, untreated samples (i.e. the datasets above).

Trained model coefficients:

After training the binary classifiers for each cell type, the model coefficients can be used to investigate up and downregulated genes in each cell type. Below, we post the model coefficients for the one-versus-rest trained binary classifiers (used in the Isotonic Regression and True Path Rule algorithms) as well as the coefficients for the classifiers in the Cascaded Logistic Regression algorithm. Each model was trained on the full set of bulk RNA-seq samples used in the study. Each algorithm's cell type model coefficients are available in a tab-separated-value file:

Running the classifiers with CellO:

We packaged the classifiers into a software package we call CellO (Cell Ontology-based classification) available on GitHub here: https://github.com/deweylab/CellO

License

All code and data used in this work are licensed under CC BY 4.0.