Run the script by executing the following command:
python s1_qc.py -p <prefix> [-o <output_directory>] [-m <mitochondrial_genes_percent>] [-s]
- <prefix>: Prefix of the input file (required).
- <output_directory>: Output directory for the results (default: "qc").
- <mitochondrial_genes_percent>: Percent of mitochondrial genes to consider as outliers (default: 15).
-s
: Flag indicating whether to scale the data (default: False).adata.raw
is created ins1*.py
after filtering outliers normalization.
The script produces the following outputs:
- QC Result:
- File:
d1_<prefix>_qced.h5ad
- Description: Filtered and quality-controlled dataset after outlier removal.
- File:
- Doublet detection
- powered by package
DoubletDetection
- powered by package
- Normalized Result:
- File:
d2_<prefix>_normlog.h5ad
- Description: Normalized and log-transformed dataset after QC.
- Notice: contain
adata.raw
(include all genes after normalization) - Add doublet prediction result and doublet score columns in
adata.obs
- File:
- Figures:
- Location:
<output_directory>
- Description:
- QC related: total counts histogram, mitochondrial gene percentage violin plot, scatter plot of total counts vs. gene counts, and violin plot of QC metrics.
- Doublet detection: Doublet heatmap
- Location:
python s3_bbknn.py -n <number> -p <prefix> [-b <batch_key>]
<number>
: The number of highly variable genes to be selected (default: 2000).<prefix>
: Prefix of the input file (required).<batch_key>
: Batch key of the file (default: 'patient').
sc.pp.pca
sc.pp.neighbors
sc.tl.umap
sc.tl.leiden
The script produces the following outputs:
- Highly Variable Genes (HVGs) Result:
- File:
d3_<prefix>.h5ad
- Description: Dataset containing the selected highly variable genes.
- File:
- PCA Result:
- Description: Principal Component Analysis (PCA) performed on the selected HVGs.
- UMAP Result:
- Description: Uniform Manifold Approximation and Projection (UMAP) performed on the PCA results.
- Figure: UMAP plot showing the clustering results colored by 'leiden' and 'batch_key'.
- File:
before_bbknn.png
- Location:
cluster/
- File:
- UMAP Result (Saved Dataset):
- File:
d4_<suffix>_umap.h5ad
- Description: Dataset with UMAP coordinates and clustering information.
- File:
python s3_bbknn.py -p <prefix> -b <batch_key>
<prefix>
: Prefix of the input file (required).<batch_key>
: Batch key of the file (required).
The script produces the following outputs:
- BBKNN Result:
- File:
d5_<prefix>.h5ad
- Description: Dataset after performing the BBKNN integration.
- File:
- UMAP Result:
- Figure: UMAP plot showing the clustering results after BBKNN integration, colored by 'leiden_r2' and 'batch_key'.
- File:
after_bbknn.png
- Location:
cluster/
- File:
- Figure: UMAP plot showing the clustering results after BBKNN integration, colored by 'leiden_r2' and 'batch_key'.
Note: leiden clustering resolution=2
, the result is stored in key leiden_r2
adata.layers["counts"]
is also created in s1.py
. Data in counts
layer is un-normalized and not log-transformed.
I used resolution=2
to run leiden clustering. And the result after bbknn is stored in key leiden_r2
.
marker file:
should contain at least cell_name
and Symbol
two columns
Produce boxplot, heatmap and conduct statistical test.
--heatmap
:
all
: draw one heatmap with all conditions (levels)sep
: draw a separate heatmap for each condition (level)no
: do not draw heatmapheatmap.csv
: provide a custom file for grouping
Format of heatmap.csv
(|
represents ,
in csv file):
Tumor | I II III |
---|---|
[condition] | [states (separated by spaces)] |
--filter_sample
:
If specified a number, the sample whose total number of cells (of the same cell type) below this threshold will be filtered.
We used 15
as threshold.
--test_type
:
Choices are 1 (means single-sided test) or 2 (means double sided test). Default option is 2.