Data processing

`AnnData` object ind I/O

The AnnData struct is imported from Muon.jl. The package provides read and write functions for .h5ad and .h5mu files, the typical H5-based format for storing Python anndata objects. The AnnData object stores datasets together with metadata, such as information on the variables (genes in scRNA-seq data) and observations (cells), as well as different kinds of annotations and transformations of the original count matrix, such as PCA or UMAP embeddings, or graphs of observations or variables.

For details on the Julia implementation in Muon.jl, see the documentation.

For more details on the original Python implementation of the anndata object, see the documentation and preprint.

Library size and normalization

scVI.init_library_size — Function

init_library_size(adata::AnnData; batch_key::Symbol=:batch)

Computes and returns library size based on AnnData object.

Based on the scvi-tools function from here

Returns a tupe of arrays of length equal to the number of batches in adata as stored in adata.obs[!,:batch_key], containing the means and variances of the library size in each batch in adata. Default batch key: :batch, if it is not found, defaults to 1 batch.

scVI.estimate_size_factors — Function

estimate_size_factors(mat; locfunc=median)

Estimates size factors to use for normalization, based on the corresponding Seurat functionality. Assumes a countmatrix mat in cell x gene format as input, returns a vector of size factors.

For details, please see the Seurat documentation.

scVI.normalize_size_factors — Function

normalize_size_factors(mat::Abstractmatrix)

Normalizes the countdata in mat by dividing it by the size factors calculated with estimate_size_factors. Assumes a countmatrix mat in cell x gene format as input, returns the normalized matrix.

scVI.normalize_size_factors! — Function

normalize_size_factors(adata::AnnData)

Normalizes the adata.X by dividing it by the size factors calculated with estimate_size_factors. Adds the normalized count matrix to adata.layers and returns adata.

scVI.normalize_total! — Function

normalize_total!(adata::AnnData; 
    target_sum::Union{Nothing, Real}=nothing, 
    key_added::String="cell_counts",
    layer::Union{Nothing, String}=nothing,
    verbose::Bool=false)

Normalizes counts per cell, specifically normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. If choosing target_sum=1e6, this is CPM normalization.

Basic version of the scanpy.pp.normalize_total function

Arguments

adata: AnnData object
target_sum: if nothing, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.
key_added: name of the field in adata.obs where the normalization factor is stored, set to "cell_counts" by default
layer: optional; which layer to normalize on. If nothing, adata.X is used.

Returns

Returns adata with normalized version of the original adata.X in adata.layers and the size factors in adata.obs.

scVI.normalize_total — Function

normalize_total(adata::AnnData; 
    target_sum::Union{Nothing, Real}=1e4, 
    key_added::String="cell_counts",
    verbose::Bool=false)

Normalizes counts per cell, specifically normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. See normalize_total! for details. Unlike the in-place version, this function returns a dictionary with the normalized counts and scaled counts per cell.

Filtering

scVI.filter_cells — Function

function filter_cells(adata::AnnData; 
    min_counts::Union{Int, Nothing}=nothing, 
    min_genes::Union{Int, Nothing}=nothing, 
    max_counts::Union{Int, Nothing}=nothing, 
    max_genes::Union{Int, Nothing}=nothing,
    verbose::Bool = true)

Filter cell outliers based on counts and numbers of genes expressed. For instance, only keep cells with at least min_counts counts or min_genes genes expressed. This is to filter measurement outliers, i.e. “unreliable” observations. Only provide one of the optional parameters min_counts, min_genes, max_counts, max_genes per call.

Parameters

adata: AnnData object of shape n_obs × n_vars. Rows correspond to cells and columns to genes.
min_counts: Minimum number of counts required for a cell to pass filtering.
min_genes: Minimum number of genes expressed required for a cell to pass filtering.
max_counts: Maximum number of counts required for a cell to pass filtering.
max_genes: Maximum number of genes expressed required for a cell to pass filtering.

Returns

cells_subset: BitVector index mask that does filtering; true means that the cell is kept, false means the cell is removed.
number_per_cell Depending on what was thresholded (counts or genes), the array stores n_counts or n_cells per gene.

scVI.filter_cells! — Function

function filter_cells!(adata::AnnData; 
    min_counts::Union{Int, Nothing}=nothing, 
    min_genes::Union{Int, Nothing}=nothing, 
    max_counts::Union{Int, Nothing}=nothing, 
    max_genes::Union{Int, Nothing}=nothing,
    verbose::Bool = true)

Filter cell outliers based on counts and numbers of genes expressed. For instance, only keep cells with at least min_counts counts or min_genes genes expressed. This is to filter measurement outliers, i.e. “unreliable” observations. Only provide one of the optional parameters min_counts, min_genes, max_counts, max_genes per call.

Parameters

adata: AnnData object of shape n_obs × n_vars. Rows correspond to cells and columns to genes.
min_counts: Minimum number of counts required for a cell to pass filtering.
min_genes: Minimum number of genes expressed required for a cell to pass filtering.
max_counts: Maximum number of counts required for a cell to pass filtering.
max_genes: Maximum number of genes expressed required for a cell to pass filtering.

Returns

The filtered AnnData object

scVI.filter_genes! — Function

filter_genes!(adata::AnnData; 
    min_counts::Union{Int, Nothing}=nothing, 
    min_cells::Union{Int, Nothing}=nothing, 
    max_counts::Union{Int, Nothing}=nothing, 
    max_cells::Union{Int, Nothing}=nothing,
    verbose::Bool = true)

Filter genes based on number of cells or counts. Keep genes that have at least min_counts counts or are expressed in at least min_cells cells or have at most max_counts counts or are expressed in at most max_cells cells. Only provide one of the optional parameters min_counts, min_cells, max_counts, max_cells per call.

Parameters

adata: AnnData object of shape n_obs × n_vars. Rows correspond to cells and columns to genes.
min_counts: Minimum number of counts required for a gene to pass filtering.
min_cells: Minimum number of cells expressed required for a gene to pass filtering.
max_counts: Maximum number of counts required for a gene to pass filtering.
max_cells: Maximum number of cells expressed required for a gene to pass filtering.

Returns

The filtered AnnData object

scVI.filter_genes — Function

filter_genes(adata::AnnData; 
    min_counts::Union{Int, Nothing}=nothing, 
    min_cells::Union{Int, Nothing}=nothing, 
    max_counts::Union{Int, Nothing}=nothing, 
    max_cells::Union{Int, Nothing}=nothing,
    verbose::Bool = true)

Filter genes based on number of cells or counts. Keep genes that have at least min_counts counts or are expressed in at least min_cells cells or have at most max_counts counts or are expressed in at most max_cells cells. Only provide one of the optional parameters min_counts, min_cells, max_counts, max_cells per call. This is the out-of-place version; for details on the input arguments, see the in-place version filter_genes!

Returns

A BitVector with true for every gene that should be included and a vector specifying for each gene the number of cells it is expressed in.

Simple transformations

scVI.log_transform! — Function

log_transform!(adata::AnnData; 
    layer::String="normalized",
    verbose::Bool=false)

Log-transforms the data. Looks for a layer of normalized counts in adata.layers["normalized"]. If the layer is not there, it uses adata.X. Returns the adata object with the log-transformed values in a new layer "log_transformed".

scVI.logp1_transform! — Function

logp1_transform!(adata::AnnData; 
    layer::Union{String, Nothing}=nothing,
    verbose::Bool=false)

Log-transforms the (count) data, adding a pseudocount of 1. Uses the X in adata.X by default, but other layers can be passed using the layer keyword. Returns the adata object with the log-transformed values in a new layer "logp1_transformed".

scVI.sqrt_transform! — Function

sqrt_transform!(adata::AnnData; 
    layer::String="normalized",
    verbose::Bool=false)

Sqrt-transforms the data. Looks for a layer of normalized counts in adata.layers["normalized"]. If the layer is not there, it uses adata.X. Returns the adata object with the sqrt-transformed values in a new layer "sqrt_transformed".

scVI.rescale! — Function

rescale!(adata::AnnData; 
    layer::Union{String, Nothing}=nothing,
    verbose::Bool=false)

Rescales the data to zero mean and unit variance in each gene, using the specified layer. If none is provided, it uses adata.X. Returns the adata object with the rescales values in a new layer "rescaled".

Dimension reduction

scVI.pca! — Function

function pca!(adata::AnnData; 
    layer::String="log_transformed", 
    n_pcs::Int=size(adata.X,2),
    verbose::Bool=false
)

Performs a PCA on the specified layer of an AnnData object and stores the results in adata.obsm. Uses all variables of the layer by default, but the number of PCs to be stored can also be specified with the n_pcs keyword. Returns the modified AnnData object.

scVI.umap! — Function

function umap!(adata::AnnData; 
    layer::String="log_transformed", 
    use_pca_init::Bool=false, 
    n_pcs::Int=100, 
    verbose::Bool=true, 
    kwargs...)

Performs UMAP on the specified layer of an AnnData object. If the layer is not found, the log-transformed normalized counts are calculated and used. Optionally, UMAP can be run on a PCA representation, the number of PCs can be specified (default=100). For customizing the behaviour or UMAP, see the keyword arguments of the UMAP.UMAP_ function. They can all be passed via the kwargs.

The fields of the resulting UMAP_ struct are stored as follows: - the UMAP embedding in adata.obsm["umap"], - the fuzzy simplicial knn graph in adata.obsp["fuzzyneighborgraph"], - the KNNs of each cell in adata.obsm["knns"], - the distances of each cell to its KNNs in adata.obsm["knn_dists"]

Returns the modified AnnData object.

Highly variable genes

scVI.highly_variable_genes — Method

highly_variable_genes(adata::AnnData;
    layer::Union{String,Nothing} = nothing,
    n_top_genes::Int=2000,
    batch_key::Union{String,Nothing} = nothing,
    span::Float64=0.3
    )

Computes highly variable genes according to the workflows on scanpy and Seurat v3 per batch and returns a dictionary with the information on the joint HVGs. For the in-place version, see highly_variable_genes!

More specifically, it is the Julia re-implementation of the corresponding scanpy function For implementation details, please check the scanpy/Seurat documentations or the source code of the lower-level _highly_variable_genes_seurat_v3 function in this package. Results are almost identical to the scanpy function. The differences have been traced back to differences in the local regression for the mean-variance relationship implemented in the Loess.jl package, that differs slightly from the corresponding Python implementation.

Arguments

adata: AnnData object
layer: optional; which layer to use for calculating the HVGs. Function assumes this is a layer of counts. If layer is not provided, adata.X is used.
n_top_genes: optional; desired number of highly variable genes. Default: 2000.
batch_key: optional; key where to look for the batch indices in adata.obs. If not provided, data is treated as one batch.
span: span to use in the loess fit for the mean-variance local regression. See the Loess.jl docs for details.
replace_hvgs: whether or not to replace the hvg information if there are already hvgs calculated. If false, the new values are added with a "_1" suffix. Default:true,
verbose: whether or not to print info on current status

Returns

Returns a dictionary containing information on the highly variable genes, specifically containing the following keys is added:

highly_variable: vector of Bools indicating which genes are highly variable
highly_variable_rank: rank of the highly variable genes according to (corrected) variance
means: vector with means of each gene
variances: vector with variances of each gene
variances_norm: normalized variances of each gene
highly_variable_nbatches: if there are batches in the dataset, logs the number of batches in which each highly variable gene was actually detected as highly variable.

scVI.highly_variable_genes! — Method

highly_variable_genes!(adata::AnnData;
    layer::Union{String,Nothing} = nothing,
    n_top_genes::Int=2000,
    batch_key::Union{String,Nothing} = nothing,
    span::Float64=0.3,
    replace_hvgs::Bool=true,
    verbose::Bool=false
    )

Computes highly variable genes per batch according to the workflows on scanpy and Seurat v3 in-place. This is the in-place version that adds an dictionary containing information on the highly variable genes directly to the adata.var and returns the modified AnnData object. For details, see the not-in-place version ?highly_variable_genes.

scVI.subset_to_hvg! — Method

subset_to_hvg!(adata::AnnData;
    layer::Union{String,Nothing} = nothing,
    n_top_genes::Int=2000,
    batch_key::Union{String,Nothing} = nothing,
    span::Float64=0.3,
    verbose::Bool=true
)

Calculates highly variable genes with highly_variable_genes! and subsets the AnnData object to the calculated HVGs. For description of input arguments, see highly_variable_genes!

Returns: adata object subset to the calculated HVGs, both in the countmatrix/layer data used for HVG calculation and in the adata.var dictionary.

Loading built-in datasets

There are currently three datasets for which the package supplies built-in convenience functions for loading, processing and creating corresponding AnnData objects. They can be downloaded from this Google Drive data folder. The folder contains all three datasets, namely

the cortex data, corresponding to the cortex dataset from the scvi-tools from Zeisel et al. 2015. The original data can be found here and has been processed analogous to the scvi-tools processing
the tasic data from Tasic et al. (2016), available at Gene expression Omnibus (GEO) under accession number GSE71585. Preprocessing and additional annotation according to the original manuscript; annotations are available and loaded together with the countmatrix.
the pbmc data (PBMC8k) from Zheng et al. 2017, preprocessed according to the Bioconductor workflow.

I recommend downloading the complete GoogleDrive folder and having it as a subfolder named data in the current working directory. Then, in any Julia script in the parent directory, the functions load_cortex(), load_pbmc() and load_tasic() can be called without arguments, because the default path where these functions look for the respective dataset is exactly that subfolder named data.

Cortex data

Missing docstring.

Missing docstring for load_cortex_from_h5ad. Check Documenter's build log for details.

scVI.load_cortex — Function

load_cortex(path::String=""; verbose::Bool=false)

Loads cortex dataset from Zeisel et al. 2015 and creates a corresponding AnnData object.

Looks for a file cortex_anndata.h5ad that can be downloaded from this GoogleDrive data folder. The functions first looks in the folder passed as path (default: assumes files are in a subfolder named data of the current directory, i.e., that the complete GoogleDrive data folder has been downloaded in the current directory), and alternatively downloads the data if is cannot find the file in the given path (see below).

The file is the h5 export of the Python AnnData object provided as built-in cortex dataset from scvi-tools, data is from Zeisel et al. 2015.

If the file is present, the data is loaded from the Python AnnData object and stored in an analogous Julia AnnData object. This is handled by the functions init_cortex_from_h5ad and load_cortex_from_h5ad.

Alternatively, if the h5ad file is not found in the folder, the data is downloaded directly from the original authors and processed analogous to the scvi-tools processing, and subsequently stored to a Julia AnnData object. This is handled by the function init_cortex_from_url.

Returns the Julia AnnData object.

Example

julia> load_cortex()
    AnnData object with a countmatrix with 3005 cells and 1200 genes
    layers dict with the following keys: ["counts"]
    summary statistics dict with the following keys: ["n_labels", "n_vars", "n_batch", "n_continuous_covs", "n_cells", "n_proteins"]
    unique celltypes: ["interneurons", "pyramidal SS", "pyramidal CA1", "oligodendrocytes", "microglia", "endothelial-mural", "astrocytes_ependymal"]
    training status: not trained

PBMC data

scVI.load_pbmc — Function

load_pbmc(path::String = "data/")

Loads pbmc dataset from Zheng et al. 2017 and creates a corresponding AnnData object. Specifically, the PBMC8k version is used, preprocessed according to the Bioconductor workflow.

Loads the following files that can be downloaded from this GoogleDrive data folder:

PBMC_counts.csv: countmatrix
PBMC_annotation.csv: cell type annotation

Files are loaded from the folder passed as path (default: assumes files are in a subfolder named data of the current directory, i.e., that the complete GoogleDrive data folder has been downloaded in the current directory.)

From these input files, a Julia AnnData object is created. The countmatrix contains information on cell barcodes and gene names. The gene name and celltype information is stored in the vars and obs dictionaries of the AnnData object, respectively.

Returns the Julia AnnData object.

Example

julia> load_pbmc()
    AnnData object with a countmatrix with 7480 cells and 200 genes
    unique celltypes: ["B-cells", "CD4+ T-cells", "Monocytes", "CD8+ T-cells", "NK cells", "NA", "HSC", "Erythrocytes"]
    training status: not trained

Tasic data

scVI.load_tasic — Function

load_tasic(path::String = "data/")

Loads tasic dataset based on Tasic et al. (2016) and creates a corresponding AnnData object.

Loads the following files that can be downloaded from this GoogleDrive data folder:

Tasic_countmat.txt: countmatrix
Tasic_celltypes.txt: cell types
Tasic_genenames.txt: gene names
Tasic_receptorandmarkers.txt: List of receptor and marker genes

Files are loaded from the folder passed as path (default: assumes files are in a subfolder named data of the current directory, i.e., that the complete GoogleDrive data folder has been downloaded in the current directory.

The original data is available at Gene expression Omnibus (GEO) under accession number GSE71585. Preprocessing and annotation has been prepared according to the original manuscript.

From these input files, a Julia AnnData object is created. The list of receptor and marker genes is used to annotate cells as neural vs. non-neural, and annotate the neural cells as GABA- or Glutamatergic.

These annotations together with the cell type information and the gene names and receptor/marker list are stored in Dictionaries in the obs and vars fields of the AnnData object.

Additionally, size factors are calculated and used for normalizing the counts. The normalized counts are stored in an additional layer named normalized_counts.

Returns the Julia AnnData object.

Example

julia> load_tasic()
    AnnData object with a countmatrix with 1679 cells and 15119 genes
    layers dict with the following keys: ["normalized_counts", "counts"]
    unique celltypes: ["Vip", "L4", "L2/3", "L2", "Pvalb", "Ndnf", "L5a", "SMC", "Astro", "L5", "Micro", "Endo", "Sst", "L6b", "Sncg", "Igtp", "Oligo", "Smad3", "OPC", "L5b", "L6a"]
    training status: not trained

scVI.subset_tasic! — Function

subset_tasic!(adata::AnnData)

Subsets an input AnnData object initialized from the Tasic data according load_tasic to the neural cells and the receptor and marker genes provided as annotation.

Specifically, the count matrix and the normalized count matrix are subset to these cells and genes, and the dictionaries with information about cells and genes in adata.obs and adata.vars are also subset accordingly.

Returns the modified AnnData object.