Data processing
AnnData
object and I/O
The AnnData
struct is imported from Muon.jl
. The package provides read and write functions for .h5ad
and .h5mu
files, the typical H5-based format for storing Python anndata
objects. The AnnData
object stores datasets together with metadata, such as information on the variables (genes in scRNA-seq data) and observations (cells), as well as different kinds of annotations and transformations of the original count matrix, such as PCA or UMAP embeddings, or graphs of observations or variables.
For details on the Julia implementation in Muon.jl
, see the documentation.
For more details on the original Python implementation of the anndata
object, see the documentation and preprint.
scVI.get_celltypes
— Functionget_celltypes(a::AnnData)
Tries to infer the cell types of cells in an AnnData
object.
Returns a vector of cell type names if the cell types are stored in adata.obs["cell_type"]
, adata.obs["celltype"]
, adata.obs["celltypes"]
, or adata.obs["cell_types"]
. Otherwise, returns nothing
.
scVI.subset_adata
— Functionsubset_adata(adata::AnnData, subset_inds::Tuple, dims::Symbol=:both)
Subset an AnnData
object by indices passed as a tuple of vectors of integers, UnitRanges, or vectors of Booleans. If dims
is set to :both, the first element of subset_inds
is used to subset cells and the second element is used to subset genes.
Arguments
adata
:AnnData
object to subsetsubset_inds
: tuple of vectors of integers, UnitRanges, or vectors of Booleansdims
: dimension to subset, either:cells
,:genes
, or:both
Returns
- a copy of the
AnnData
object with the subsetted data
scVI.subset_adata
— Methodsubset_adata(adata::AnnData, subset_inds::Union{Int, Vector{Int}, UnitRange, Vector{Bool}}, dims::Symbol)
Subset an AnnData
object by indices passed as a vector of integers or booleans or as a UnitRange. The dims
argument can be set to either :cells
or :genes
to specify which dimension to subset.
Arguments
adata
:AnnData
object to subsetsubset_inds
: vector of integers or booleans or UnitRangedims
: dimension to subset, either:cells
or:genes
Returns
- a copy of the
AnnData
object with the subsetted data
scVI.subset_adata!
— Methodsubset_adata!(adata::AnnData, subset_inds, dims::Symbol)
In-place version of subset_adata
, see ?subset_adata
for more details.
For subset_inds
, either a tuple of vectors or ranges can be passed with dims
set to :both, for subsetting both cells and genes, or a single vector or range can be passed with dims
set to either :cells or :genes.
Arguments
adata
:AnnData
object to subsetsubset_inds
: tuple of vectors of integers, UnitRanges, or vectors of Booleans or vector of integers or booleans or UnitRangedims
: dimension to subset, either:cells
,:genes
, or:both
Returns
- the
AnnData
object with the subsetted data
Library size and normalization
scVI.init_library_size
— Functioninit_library_size(adata::AnnData; batch_key::Symbol=:batch)
Computes and returns library size based on AnnData
object.
Based on the scvi-tools
function from here
Returns a tupe of arrays of length equal to the number of batches in adata
as stored in adata.obs[!,:batch_key]
, containing the means and variances of the library size in each batch in adata
. Default batch key: :batch
, if it is not found, defaults to 1 batch.
Arguments
adata
:AnnData
object
Keyword arguments
batch_key
: the key inadata.obs
that stores the batch information (default::batch
)
Returns
- a tuple of arrays of length equal to the number of batches in
adata
as stored inadata.obs[!,:batch_key]
, corresponding to the log library means and vars per batch
scVI.estimate_size_factors
— Functionestimate_size_factors(mat; locfunc=median)
Estimates size factors to use for normalization, based on the corresponding Seurat functionality. Assumes a countmatrix mat
in cell x gene format as input, returns a vector of size factors.
For details, please see the Seurat documentation.
Arguments
mat
: countmatrix in cell x gene format
Keyword arguments
locfunc
: the function to use for calculating the location parameter (default:median
)
Returns
- a vector of size factors
scVI.normalize_size_factors
— Functionnormalize_size_factors(mat::Abstractmatrix)
Normalizes the countdata in mat
by dividing it by the size factors calculated with estimate_size_factors
. Assumes a countmatrix mat
in cell x gene format as input, returns the normalized matrix.
Arguments
mat
: countmatrix in cell x gene format
Returns
- the normalized countmatrix
scVI.normalize_size_factors!
— Functionnormalize_size_factors(adata::AnnData)
Normalizes the adata.X
by dividing it by the size factors calculated with estimate_size_factors
. Adds the normalized count matrix to adata.layers
and returns adata
.
Arguments
adata
:AnnData
object
Returns
- the modified
AnnData
object
scVI.normalize_total!
— Functionnormalize_total!(adata::AnnData;
target_sum::Union{Nothing, Real}=nothing,
key_added::String="cell_counts",
layer::Union{Nothing, String}=nothing,
verbose::Bool=false)
Normalizes counts per cell, specifically normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. If choosing target_sum=1e6
, this is CPM normalization.
Basic version of the scanpy.pp.normalize_total function
Arguments
adata
:AnnData
object
Keyword arguments
target_sum
: ifnothing
, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.key_added
: name of the field inadata.obs
where the normalization factor is stored, set to "cell_counts" by defaultlayer
: optional; which layer to normalize on. Ifnothing
,adata.X
is used.
Returns
- the
adata
object with normalized version of the originaladata.X
inadata.layers
and the size factors inadata.obs
.
scVI.normalize_total
— Functionnormalize_total(adata::AnnData;
target_sum::Union{Nothing, Real}=1e4,
key_added::String="cell_counts",
verbose::Bool=false)
Normalizes counts per cell, specifically normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. See normalize_total!
for details. Unlike the in-place version, this function returns a dictionary with the normalized counts and scaled counts per cell.
Arguments
adata
:AnnData
object
Keyword arguments
target_sum
: ifnothing
, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.key_added
: name of the field inadata.obs
where the normalization factor is stored, set to "cell_counts" by defaultverbose
: whether to print progress messages (default: false)
Returns
- a dictionary with the normalized counts and scaled counts per cell
Filtering
scVI.filter_cells
— Functionfunction filter_cells(countmatrix::AbstractMatrix;
min_counts::Union{Int, Nothing}=nothing,
min_genes::Union{Int, Nothing}=nothing,
max_counts::Union{Int, Nothing}=nothing,
max_genes::Union{Int, Nothing}=nothing,
verbose::Bool=true,
make_plot::Bool=false,
log_transform_plot::Bool=false,
save_plot::Bool=false,
plot_filename::String="gene_filtering_histogram.pdf"
)
Filter cell outliers based on counts and numbers of genes expressed. For instance, only keep cells with at least min_counts
counts or min_genes
genes expressed. This is to filter measurement outliers, i.e. “unreliable” observations. Only provide one of the optional parameters min_counts
, min_genes
, max_counts
, max_genes
per call.
Arguments
countmatrix
: countmatrix of shapen_obs
×n_vars
. Rows correspond to cells and columns to genes.min_counts
: Minimum number of counts required for a cell to pass filtering.min_genes
: Minimum number of genes expressed required for a cell to pass filtering.max_counts
: Maximum number of counts required for a cell to pass filtering.max_genes
: Maximum number of genes expressed required for a cell to pass filtering.
Keyword arguments
verbose
: whether to print info and status messagesmake_plot
: whether to plot the histogram of cells before filteringlog_transform_plot
: whether to log-transform the counts of cells for plottingsave_plot
: whether to save the histogram of cells before filteringplot_filename
: filename to save the histogram of cells before filtering
Returns
cells_subset
: BitVector index mask that does filtering;true
means that the cell is kept,false
means the cell is removed.number_per_cell
: Depending on what was thresholded (counts
orgenes
), the array storesn_counts
orn_genes
per cell.
scVI.filter_cells!
— Functionfunction filter_cells!(adata::AnnData;
min_counts::Union{Int, Nothing}=nothing,
min_genes::Union{Int, Nothing}=nothing,
max_counts::Union{Int, Nothing}=nothing,
max_genes::Union{Int, Nothing}=nothing,
verbose::Bool = true,
plot_before::Bool=false,
log_transform_plot::Bool=false,
plot_after::Bool=false,
save_plot_before::Bool=false,
save_plot_after::Bool=false,
plot_filename_before::String="cell_histogram_before_filtering.pdf",
plot_filename_after::String="cell_histogram_after_filtering.pdf"
)
Filter cell outliers based on counts and numbers of genes expressed. For instance, only keep cells with at least min_counts
counts or min_genes
genes expressed. This is to filter measurement outliers, i.e. “unreliable” observations. Only provide one of the optional parameters min_counts
, min_genes
, max_counts
, max_genes
per call.
Arguments
adata
:AnnData
object of shapen_obs
×n_vars
. Rows correspond to cells and columns to genes.min_counts
: Minimum number of counts required for a cell to pass filtering.min_genes
: Minimum number of genes expressed required for a cell to pass filtering.max_counts
: Maximum number of counts required for a cell to pass filtering.max_genes
: Maximum number of genes expressed required for a cell to pass filtering
Keyword arguments
verbose
: whether to print info and status messagesplot_before
: whether to plot the histogram of cells before filteringlog_transform_plot
: whether to log-transform the counts of cells for plottingplot_after
: whether to plot the histogram of cells after filteringsave_plot_before
: whether to save the histogram of cells before filteringsave_plot_after
: whether to save the histogram of cells after filteringplot_filename_before
: filename to save the histogram of cells before filteringplot_filename_after
: filename to save the histogram of cells after filtering
Returns
- the filtered
AnnData
object
scVI.filter_genes!
— Functionfilter_genes!(adata::AnnData;
min_counts::Union{Int, Nothing}=nothing,
min_cells::Union{Int, Nothing}=nothing,
max_counts::Union{Int, Nothing}=nothing,
max_cells::Union{Int, Nothing}=nothing,
verbose::Bool = true,
plot_before::Bool=false,
log_transform_plot::Bool=false,
plot_after::Bool=false,
save_plot_before::Bool=false,
save_plot_after::Bool=false,
plot_filename_before::String="gene_histogram_before_filtering.pdf",
plot_filename_after::String="gene_histogram_after_filtering.pdf"
)
Filter genes based on number of cells or counts. Keep genes that have at least min_counts
counts or are expressed in at least min_cells
cells or have at most max_counts
counts or are expressed in at most max_cells
cells. Only provide one of the optional parameters min_counts
, min_cells
, max_counts
, max_cells
per call.
Arguments
adata
:AnnData
object of shapen_obs
×n_vars
. Rows correspond
to cells and columns to genes.
min_counts
: Minimum number of counts required for a gene to pass filtering.min_cells
: Minimum number of cells expressed required for a gene to pass filtering.max_counts
: Maximum number of counts required for a gene to pass filtering.max_cells
: Maximum number of cells expressed required for a gene to pass filtering.
Keyword arguments
verbose
: whether to print info and status messagesplot_before
: whether to plot the histogram of genes before filteringlog_transform_plot
: whether to log-transform the counts of genes for plottingplot_after
: whether to plot the histogram of genes after filteringsave_plot_before
: whether to save the histogram of genes before filteringsave_plot_after
: whether to save the histogram of genes after filteringplot_filename_before
: filename to save the histogram of genes before filteringplot_filename_after
: filename to save the histogram of genes after filtering
Returns
- the filtered
AnnData
object
scVI.filter_genes
— Functionfilter_genes(adata::AnnData;
min_counts::Union{Int, Nothing}=nothing,
min_cells::Union{Int, Nothing}=nothing,
max_counts::Union{Int, Nothing}=nothing,
max_cells::Union{Int, Nothing}=nothing,
verbose::Bool=true,
make_plot::Bool=false,
log_transform_plot::Bool=false,
save_plot::Bool=false,
plot_filename::String="gene_filtering_histogram.pdf"
)
Filter genes based on number of cells or counts. Keep genes that have at least min_counts
counts or are expressed in at least min_cells
cells or have at most max_counts
counts or are expressed in at most max_cells
cells. Only provide one of the optional parameters min_counts
, min_cells
, max_counts
, max_cells
per call. This is the out-of-place version; for details on the input arguments, see the in-place version filter_genes!
Arguments
adata
:AnnData
object of shapen_obs
×n_vars
. Rows correspond to cells and columns to genes.min_counts
: Minimum number of counts required for a gene to pass filtering.min_cells
: Minimum number of cells in which it is expressed required for a gene to pass filtering.max_counts
: Maximum number of counts required for a gene to pass filtering.max_cells
: Maximum number of cells in which it is expressed required for a gene to pass filtering.
Keyword arguments
verbose
: whether to print info and status messagesmake_plot
: whether to plot the histogram of cells before filteringlog_transform_plot
: whether to log-transform the counts of cells for plottingsave_plot
: whether to save the histogram of cells before filteringplot_filename
: filename to save the histogram of cells before filtering
Returns
gene_subset
: BitVector index mask that does filtering;true
means that the gene is kept,false
means the cell is removed.number_per_gene
: Depending on what was thresholded (counts
orcells
), the array storesn_counts
orn_cells
per gene.
Simple transformations
scVI.log_transform!
— Functionlog_transform!(adata::AnnData;
layer::String="normalized",
verbose::Bool=false)
Log-transforms the data. Looks for a layer of normalized counts in adata.layers["normalized"]
. If the layer is not there, it uses adata.X
.
Arguments
adata
: theAnnData
object to be modified
Keyword arguments
layer
: the layer to be used for PCA (default: "log_transformed")verbose
: whether to print progress messages (default: false)
Returns
- the adata object with the log-transformed values in a new layer
"log_transformed"
.
scVI.logp1_transform!
— Functionlogp1_transform!(adata::AnnData;
layer::Union{String, Nothing}=nothing,
verbose::Bool=false)
Log-transforms the (count) data, adding a pseudocount of 1. Uses the X in adata.X
by default, but other layers can be passed using the layer
keyword.
Arguments
adata
: theAnnData
object to be modified
Keyword arguments
layer
: the layer to be used for log + 1 transformationverbose
: whether to print progress messages (default: false)
Returns
- the adata object with the log-transformed values in a new layer
"logp1_transformed"
.
scVI.sqrt_transform!
— Functionsqrt_transform!(adata::AnnData;
layer::String="normalized",
verbose::Bool=false)
Sqrt-transforms the data. Looks for a layer of normalized counts in adata.layers["normalized"]
. If the layer is not there, it uses adata.X
.
Arguments
adata
: theAnnData
object to be modified
Keyword arguments
layer
: the layer to be used for the transformationverbose
: whether to print progress messages (default: false)
Returns
- the adata object with the sqrt-transformed values in a new layer
"sqrt_transformed"
.
scVI.rescale!
— Functionrescale!(adata::AnnData;
layer::Union{String, Nothing}=nothing,
verbose::Bool=false)
Rescales the data to zero mean and unit variance in each gene, using the specified layer. If none is provided, it uses adata.X
.
Arguments
adata
: theAnnData
object to be modified
Keyword arguments
layer
: the layer to be used for the transformationverbose
: whether to print progress messages (default: false)
Returns
- the adata object with the rescales values in a new layer
"rescaled"
.
Dimension reduction
scVI.pca!
— Functionfunction pca!(adata::AnnData;
layer::String="log_transformed",
n_pcs::Int=size(adata.X,2),
verbose::Bool=false
)
Performs a PCA on the specified layer of an AnnData
object and stores the results in adata.obsm
. Uses all variables of the layer by default, but the number of PCs to be stored can also be specified with the n_pcs
keyword. If the layer is not found, the log-transformed normalized counts are calculated and used.
Arguments
adata
: theAnnData
object to be modified
Keyword arguments
layer
: the layer to be used for PCA (default: "log_transformed")n_pcs
: the number of PCs to be stored (default: all variables)verbose
: whether to print progress messages (default: false)
Returns
- the modified
AnnData
object
scVI.umap!
— Functionfunction umap!(adata::AnnData;
layer::String="log_transformed",
use_pca_init::Bool=false,
n_pcs::Int=100,
verbose::Bool=true,
kwargs...)
Performs UMAP on the specified layer of an AnnData
object. If the layer is not found, the log-transformed normalized counts are calculated and used. Optionally, UMAP can be run on a PCA representation, the number of PCs can be specified (default=100). For customizing the behaviour or UMAP, see the keyword arguments of the UMAP.UMAP_
function. They can all be passed via the kwargs
.
The fields of the resulting UMAP_
struct are stored as follows: - the UMAP embedding in adata.obsm["umap"]
, - the fuzzy simplicial knn graph in adata.obsp["fuzzyneighborgraph"], - the KNNs of each cell in adata.obsm["knns"]
, - the distances of each cell to its KNNs in adata.obsm["knn_dists"]
Arguments
adata
: theAnnData
object to be modified
Keyword arguments
layer
: the layer to be used for UMAP (default: "log_transformed")use_pca_init
: whether to use a PCA representation for UMAP (default: false)n_pcs
: the number of PCs to be used for UMAP (default: 100)verbose
: whether to print progress messages (default: false)kwargs
: keyword arguments forUMAP.UMAP_
Returns
- the modified
AnnData
object
Highly variable genes
scVI.highly_variable_genes
— Methodhighly_variable_genes(adata::AnnData;
layer::Union{String,Nothing} = nothing,
n_top_genes::Int=2000,
batch_key::Union{String,Nothing} = nothing,
span::Float64=0.3
)
Computes highly variable genes according to the workflows on scanpy
and Seurat v3 per batch and returns a dictionary with the information on the joint HVGs. For the in-place version, see highly_variable_genes!
More specifically, it is the Julia re-implementation of the corresponding scanpy
function For implementation details, please check the scanpy
/Seurat documentations or the source code of the lower-level _highly_variable_genes_seurat_v3
function in this package. Results are almost identical to the scanpy
function. The differences have been traced back to differences in the local regression for the mean-variance relationship implemented in the Loess.jl package, that differs slightly from the corresponding Python implementation.
Arguments
adata
:AnnData
object
Keyword arguments
layer
: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. Iflayer
is not provided,adata.X
is used.n_top_genes
: optional; desired number of highly variable genes. Default: 2000.batch_key
: optional; key where to look for the batch indices inadata.obs
. If not provided, data is treated as one batch.span
: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.replace_hvgs
: whether or not to replace the hvg information if there are already hvgs calculated. If false, the new values are added with a "_1" suffix. Default:true,verbose
: whether or not to print info on current status
Returns
- a dictionary containing information on the highly variable genes, specifically containing the following keys is added:
highly_variable
: vector ofBool
s indicating which genes are highly variablehighly_variable_rank
: rank of the highly variable genes according to (corrected) variancemeans
: vector with means of each genevariances
: vector with variances of each genevariances_norm
: normalized variances of each genehighly_variable_nbatches
: if there are batches in the dataset, logs the number of batches in which each highly variable gene was actually detected as highly variable.
scVI.highly_variable_genes!
— Methodhighly_variable_genes!(adata::AnnData;
layer::Union{String,Nothing} = nothing,
n_top_genes::Int=2000,
batch_key::Union{String,Nothing} = nothing,
span::Float64=0.3,
replace_hvgs::Bool=true,
verbose::Bool=false
)
Computes highly variable genes per batch according to the workflows on scanpy
and Seurat v3 in-place. This is the in-place version that adds an dictionary containing information on the highly variable genes directly to the adata.var
and returns the modified AnnData
object. For details, see the not-in-place version ?highly_variable_genes
.
scVI.subset_to_hvg!
— Methodsubset_to_hvg!(adata::AnnData;
layer::Union{String,Nothing} = nothing,
n_top_genes::Int=2000,
batch_key::Union{String,Nothing} = nothing,
span::Float64=0.3,
verbose::Bool=true
)
Calculates highly variable genes with highly_variable_genes!
and subsets the AnnData
object to the calculated HVGs. For description of input arguments, see highly_variable_genes!
Arguments
adata
:AnnData
object
Keyword arguments
layer
: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. Iflayer
is not provided,adata.X
is used.n_top_genes
: optional; desired number of highly variable genes. Default: 2000.batch_key
: optional; key where to look for the batch indices inadata.obs
. If not provided, data is treated as one batch.span
: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.verbose
: whether or not to print info on current status
Returns
adata
object subset to the calculated HVGs, both in the countmatrix/layer data used for HVG calculation and in theadata.var
dictionary.
Plotting
scVI.plot_pca
— Functionfunction plot_pca(
adata::AnnData;
color_by::String="",
pcs::Vector{Int}=[1,2],
recompute::Bool=false,
save_plot::Bool=false,
filename::String="PCA.pdf"
)
Plot a PCA embedding on a given AnnData
object.
Arguments
adata
: AnnData object
Keyword arguments
color_by
: column name ofadata.obs
to color the plot by, or a gene name to color by expression. If neither is provided, celltypes will be used if present, otherwise all cells will be colored the same.pcs
: which PCs to plotrecompute
: whether to recompute the PCA embedding or use an already existing onesave_plot
: whether to save the plot to a filefilename
: filename to save the plot to
Returns
- the plot object
scVI.plot_umap
— Functionfunction plot_umap(
adata::AnnData;
color_by::String="",
recompute::Bool=false,
save_plot::Bool=false,
filename::String="UMAP.pdf"
)
Plot a UMAP embedding on a given AnnData
object.
Arguments
adata
: AnnData object
Keyword arguments
color_by
: column name ofadata.obs
to color the plot by, or a gene name to color by expression. If neither is provided, celltypes will be used if present, otherwise all cells will be colored the same.recompute
: whether to recompute the UMAP embeddingsave_plot
: whether to save the plot to a filefilename
: filename to save the plot to
Returns
- the plot object
scVI.plot_histogram
— Functionfunction plot_histogram(adata::AnnData,
cell_gene::Symbol = :gene,
counts_number::Symbol = :counts;
cutoff::Union{Nothing, Real}=nothing,
log_transform::Bool=false,
save_plot::Bool=false,
filename::String="counts_per_gene_histogram.pdf"
)
Plot a histogram of counts per gene or cells, or alternatively he number of genes expressed per cell or the number of cells per gene in which it is expressed.
Additionally, a user-specified cutoff can be plotted on top, to be used as a visualization tool for filtering.
The function is called internally when using the plotting options in the filter_cells
and filter_genes
functions.
Arguments
adata
: AnnData objectcell_gene
: one of:cell
orgene
; whether to plot counts per gene or per cellcounts_number
: one of:counts
or:number
; whether to plot counts or number of genes/cells
Keyword arguments
cutoff
: cutoff to plot on top of the histogramlog_transform
: whether to log transform the countssave_plot
: whether to save the plot to a filefilename
: filename to save the plot to
Returns
hist
: the plot object
scVI.plot_highest_expressed_genes
— Functionplot_highest_expressed_genes(adata::AnnData;
n_top::Int=30,
gene_symbols::Union{String, Nothing}=nothing,
save_plot::Bool=false,
filename::String="highest_expressed_genes.pdf"
)
The function computes for each gene the fraction of counts assigned to that gene within a cell. The n_top
genes with the highest mean fraction over all cells are plotted as boxplots.
Arguments
adata
: AnnData object
Keyword arguments
n_top
: number of genes to plotgene_symbols
: column name ofadata.var
to use as gene names. Ifnothing
,adata.var_names
will be used.save_plot
: whether to save the plot to a filefilename
: filename to save the plot to
Returns
- the plot object
scVI.plot_highly_variable_genes
— Functionfunction plot_highly_variable_genes(adata::AnnData;
log_transform::Bool=false,
save_plot::Bool=false,
filename::String="highly_variable_genes.pdf"
)
end
Plot dispersions or normalized variance versus means for genes
Arguments
adata
: AnnData object
Keyword arguments
log_transform
: whether to log transform the countssave_plot
: whether to save the plot to a filefilename
: filename to save the plot to
Returns
- the plot object
Loading built-in datasets
There are currently three datasets for which the package supplies built-in convenience functions for loading, processing and creating corresponding AnnData
objects. They can be downloaded from this Google Drive data
folder. The folder contains all three datasets, namely
- the
cortex
data, corresponding to thecortex
dataset from thescvi-tools
from Zeisel et al. 2015. The original data can be found here and has been processed analogous to thescvi-tools
processing - the
tasic
data from Tasic et al. (2016), available at Gene expression Omnibus (GEO) under accession number GSE71585. Preprocessing and additional annotation according to the original manuscript; annotations are available and loaded together with the countmatrix. - the
pbmc
data (PBMC8k) from Zheng et al. 2017, preprocessed according to the Bioconductor workflow.
I recommend downloading the complete GoogleDrive folder and having it as a subfolder named data
in the current working directory. Then, in any Julia script in the parent directory, the functions load_cortex()
, load_pbmc()
and load_tasic()
can be called without arguments, because the default path
where these functions look for the respective dataset is exactly that subfolder named data
.
Cortex data
scVI.load_cortex
— Functionload_cortex(path::String=""; verbose::Bool=false)
Loads cortex
dataset from Zeisel et al. 2015 and creates a corresponding AnnData
object.
Looks for a file cortex_anndata.h5ad
that can be downloaded from this GoogleDrive data
folder. The functions first looks in the folder passed as path
(default: assumes files are in a subfolder named data
of the current directory, i.e., that the complete GoogleDrive data
folder has been downloaded in the current directory), and alternatively downloads the data if is cannot find the file in the given path
(see below).
The file is the h5
export of the Python AnnData
object provided as built-in cortex
dataset from scvi-tools
, data is from Zeisel et al. 2015.
If the file is present, the data is loaded from the Python AnnData
object and stored in an analogous Julia AnnData
object. This is handled by the functions init_cortex_from_h5ad
and load_cortex_from_h5ad
.
Alternatively, if the h5ad
file is not found in the folder, the data is downloaded directly from the original authors and processed analogous to the scvi-tools
processing, and subsequently stored to a Julia AnnData
object. This is handled by the function init_cortex_from_url
.
Arguments
path::String
: path to the folder where the file is storedverbose::Bool
: whether to print progress information
Returns
- the Julia
AnnData
object.
Example
julia> load_cortex()
AnnData object with a countmatrix with 3005 cells and 1200 genes
layers dict with the following keys: ["counts"]
summary statistics dict with the following keys: ["n_labels", "n_vars", "n_batch", "n_continuous_covs", "n_cells", "n_proteins"]
unique celltypes: ["interneurons", "pyramidal SS", "pyramidal CA1", "oligodendrocytes", "microglia", "endothelial-mural", "astrocytes_ependymal"]
training status: not trained
PBMC data
scVI.load_pbmc
— Functionload_pbmc(path::String = "data/")
Loads pbmc
dataset from Zheng et al. 2017 and creates a corresponding AnnData
object. Specifically, the PBMC8k version is used, preprocessed according to the Bioconductor workflow.
Loads the following files that can be downloaded from this GoogleDrive data
folder:
PBMC_counts.csv
: countmatrixPBMC_annotation.csv
: cell type annotation
Files are loaded from the folder passed as path
(default: assumes files are in a subfolder named data
of the current directory, i.e., that the complete GoogleDrive data
folder has been downloaded in the current directory.)
From these input files, a Julia AnnData
object is created. The countmatrix contains information on cell barcodes and gene names. The gene name and celltype information is stored in the vars
and obs
dictionaries of the AnnData
object, respectively.
Arguments
path
: path to the folder containing the input files (default: "data/")
Returns
- the Julia
AnnData
object.
Example
julia> load_pbmc()
AnnData object with a countmatrix with 7480 cells and 200 genes
unique celltypes: ["B-cells", "CD4+ T-cells", "Monocytes", "CD8+ T-cells", "NK cells", "NA", "HSC", "Erythrocytes"]
training status: not trained
Tasic data
scVI.load_tasic
— Functionload_tasic(path::String = "data/")
Loads tasic
dataset based on Tasic et al. (2016) and creates a corresponding AnnData
object.
Loads the following files that can be downloaded from this GoogleDrive data
folder:
Tasic_countmat.txt
: countmatrixTasic_celltypes.txt
: cell typesTasic_genenames.txt
: gene namesTasic_receptorandmarkers.txt
: List of receptor and marker genes
Files are loaded from the folder passed as path
(default: assumes files are in a subfolder named data
of the current directory, i.e., that the complete GoogleDrive data
folder has been downloaded in the current directory.
The original data is available at Gene expression Omnibus (GEO) under accession number GSE71585. Preprocessing and annotation has been prepared according to the original manuscript.
From these input files, a Julia AnnData
object is created. The list of receptor and marker genes is used to annotate cells as neural vs. non-neural, and annotate the neural cells as GABA- or Glutamatergic.
These annotations together with the cell type information and the gene names and receptor/marker list are stored in Dictionaries in the obs
and vars
fields of the AnnData
object.
Additionally, size factors are calculated and used for normalizing the counts. The normalized counts are stored in an additional layer
named normalized_counts
.
Arguments
path
: path to the folder containing the input files (default: "data/")
Returns
- the Julia
AnnData
object.
Example
julia> load_tasic()
AnnData object with a countmatrix with 1679 cells and 15119 genes
layers dict with the following keys: ["normalized_counts", "counts"]
unique celltypes: ["Vip", "L4", "L2/3", "L2", "Pvalb", "Ndnf", "L5a", "SMC", "Astro", "L5", "Micro", "Endo", "Sst", "L6b", "Sncg", "Igtp", "Oligo", "Smad3", "OPC", "L5b", "L6a"]
training status: not trained
scVI.subset_tasic!
— Functionsubset_tasic!(adata::AnnData)
Subsets an input AnnData
object initialized from the Tasic data according load_tasic
to the neural cells and the receptor and marker genes provided as annotation.
Specifically, the count matrix and the normalized count matrix are subset to these cells and genes, and the dictionaries with information about cells and genes in adata.obs
and adata.vars
are also subset accordingly.
Arguments
adata
:AnnData
object initialized from the Tasic data
Returns
- the modified
AnnData
object.