Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnboundLocalError #18

Open
kzb193 opened this issue Dec 29, 2024 · 4 comments
Open

UnboundLocalError #18

kzb193 opened this issue Dec 29, 2024 · 4 comments

Comments

@kzb193
Copy link

kzb193 commented Dec 29, 2024

Hello Dr. Ma,

I want to apply CalicoST on a Visium sample, and I have annotations for the normal spots. I encountered errors in each of the following two cases.

  1. tumor purity estimation module (normalidx_file : None)
  2. infer CNAs and clones with the annotations (tumorprop_file : None)

Interestingly, the 'infer CNAs and clones' step finished successfully without estimating tumor proportions and without the annotations.
Here are some details on the errors.

Error message from case 1

Traceback (most recent call last):
File "/net/mulan/home/kalinsba/apps/CalicoST/src/calicost/estimate_tumor_proportion.py", line 120, in
main(args.configfile)
File "/net/mulan/home/kalinsba/apps/CalicoST/src/calicost/estimate_tumor_proportion.py", line 101, in main
loh_states, is_B_lost, rdr_values, clones_hightumor = identify_loh_per_clone(single_X, combined_assignment, combined_pred_cnv, combined_p_binom, normal_candidate, single_total_bb_RD)
File "/net/mulan/home/kalinsba/apps/CalicoST/src/calicost/utils_hmrf.py", line 623, in identify_loh_per_clone
return loh_states, is_B_lost, rdr_values[loh_states], clones_hightumor
UnboundLocalError: local variable 'loh_states' referenced before assignment

Error message from case 2

/net/mulan/home/kalinsba/apps/CalicoST/src/calicost/hmm_NB_BB_phaseswitch.py:650: RuntimeWarning: invalid value encountered in divide
lambd = np.sum(base_nb_mean, axis=1) / np.sum(base_nb_mean)
Traceback (most recent call last):
File "/net/mulan/home/kalinsba/apps/CalicoST/src/calicost/calicost_main.py", line 444, in
main(args.configfile)
File "/net/mulan/home/kalinsba/apps/CalicoST/src/calicost/calicost_main.py", line 138, in main
index_normal = np.where(normal_candidate)[0]
UnboundLocalError: local variable 'normal_candidate' referenced before assignment

Can you please help?

@Congm12
Copy link
Contributor

Congm12 commented Dec 30, 2024

Hi,

Regarding your case 2, currently CalicoST does not fully support providing normal annotation file. The normalidx_file argument is a feature under development, and it should always be None under the current implementation.

Regarding case 1, the most probable explanations (1) the BAF threshold to identify an HMM state are too low and does allow any of the states to be loh_states, or (2) the threshold of the number of genomic bins with low BAF values are too high. These parameter values are based on previous datasets, which likely have large differences with your visium sample. We will add this point to our to do list and come with better suggestions of these parameters in the next version.

Best,
Cong

@kzb193
Copy link
Author

kzb193 commented Dec 30, 2024

Thank you very much Dr. Ma for the details. I have a follow-up query. How can I apply CalicoST for non-Visium data such as Slide-seqV2? I am not sure about the construction of bamlist.tsv in that case. Can you please help?

@Congm12
Copy link
Contributor

Congm12 commented Jan 2, 2025

We haven't tested on Slide-seqV2 dataset on our side. But as the read/UMI counts per spot are very low in Slide-seqV2, I would suggest (1) try running the preprocessing Snakemake pipeline and evaluate whether there are enough het SNPs with relative high counts, and if so (2) creating "pseudo spots" to enlarge the count per "spot" by aggregating nearby spots into non-overlapping windows, and run CalicoST on the pseudo spots.

For (1), you can organize the BAM file and barcodes similar to a spaceranger output. For each sample, the directory should contain

  • sample_1/
    • possorted_genome_bam.bam
    • possorted_genome_bam.bam.bai
    • filtered_feature_bc_matrix/
      • barcodes.tsv.gz
  • sample_2/
    • possorted_genome_bam.bam
    • possorted_genome_bam.bam.bai
    • filtered_feature_bc_matrix/
      • barcodes.tsv.gz
  • ...

With only one Slide-seqV2 sample, update in the spaceranger_dir field to the constructed directory in the config.yaml file (https://github.com/raphael-group/CalicoST/blob/main/config.yaml). With multiple Slide-seqV2 samples, replace the spaceranger_dir line with bamlist: <a bamlist.tsv file to be constructed> in config.yaml file, where the bamlist.tsv file is a tab-delimited table with three columns (path_to_bamfile, sample_id, path_to_spaceranger_directories) and without header. Additionally, Slide-seqV2 may have a different cell barcode tag and UMI tag in the bam file, please update the following two liens in config.yaml file accordingly.

UMItag: "Auto"
cellTAG: "CB"

Once the config.yaml file is updated and the paths to reference files are filled in, run the snakemake pipeline by

snakemake --cores <number threads> --configfile config.yaml --snakefile calicost.smk --keep-incomplete all

If the preprocessing is successful, cell_snp_Aallele.npz and cell_snp_Ballele.npz for A and B allele counts per het SNP per spot will be generated. You can examine the number of het SNPs and the het SNPs with reasonable read counts by loading them in python

import numpy as np
import scipy

# load sparse matrices
A = scipy.sparse.load_npz('cell_snp_Aallele.npz')
B = scipy.sparse.load_npz('cell_snp_Ballele.npz')

# print out number of het SNPs
print(A.shape)

# print out the number of het SNPs with reasonable read count
totalcount_per_snp = A.sum(axis=0).A.flatten() + B.sum(axis=0).A.flatten()
print(np.sum(totalcount_per_snp > 5))
print(np.sum(totalcount_per_snp > 10))

If the number of het SNPs with reasonable read count is too low (<10000), the data is too sparse to use the B allele frequency signal, and it's unlikely to get meaningful allele-specific CNAs from CalicoST.

But if there are enough het SNPs with reasonable read count, it's worth trying aggregating spots into "pseudo spots", with the goal that each "pseudo spot" have more than 200 SNP-covering UMI counts on average (the sum across axis=1 of the above A and B allele counts corresponding to SNP-covering UMI per spot). Then, create spaceranger directories according to the pseudo spots

  • sample_1/
    • filtered_feature_bc_matrix.h5ad
    • spatial/
      • tissue_positions.csv
  • ...

Update snpinfo/cell_snp_Aallele.npz, snpinfo/cell_snp_Ballele.npz and snpinfo/barcodes.txt according to the pseudo spots. And finally run CalicoST on the pseudo spot level.

Best,
Cong

@kzb193
Copy link
Author

kzb193 commented Jan 3, 2025

These details are extremely helpful. Thank you very much Dr. Ma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants