Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range #358

Open
GACGAMA opened this issue Feb 5, 2025 · 3 comments
Open

IndexError: list index out of range #358

GACGAMA opened this issue Feb 5, 2025 · 3 comments

Comments

@GACGAMA
Copy link

GACGAMA commented Feb 5, 2025

Hello all
Both tumor and normal samples from PacBio Hifi fails at Clair3 with index out of range and memory errors. I was not expecting Clair3 to use more than 96GB of RAM for a long read run, but I'm not sure.

Setup

Operating system: Cluster - CentOS
Clair3 version: v1.0.9
Installation method (Docker, built from source, etc.): Singularity (latest version, v1.0.9)
Type of data: Pacbio Hifi -> from article https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02803-x , tumor sample: https://www.ncbi.nlm.nih.gov/biosample/10102573 , SRX5735702

48 Threads available, running program with 32 threads and 192GB RAM

Any additional context:
Files have been mapped using minimap2 with -x map-hifi and a custom made reference: GRCh38_full_analysis_set_plus_decoy_hla.fa
Both tumor and normal BAMs seems to have the same pattern, but I can post one read here if necessary.

Command used:

INPUT_DIR="/scratch4/nsobrei2/ggama1/longread_cnv_sv/bam/pacbio_test_paper"
OUTPUT_DIR="/scratch4/nsobrei2/ggama1/longread_cnv_sv/clair3/vcf"      
THREADS="32"
MODEL_NAME="hifi_sequel2"
REF_DIR="/scratch4/nsobrei2/references/ncbi_grch38_cipher"

singularity exec -B ${INPUT_DIR},${OUTPUT_DIR} /scratch4/nsobrei2/singularities/clair3_latest.sif /opt/bin/run_clair3.sh --bam_fn=${INPUT_DIR}/HCC1395BL_TUMOR_BREAST_CANCER_PACBIO.bam --ref_fn=${REF_DIR}/GRCh38_full_analysis_set_plus_decoy_hla.fa --threads=${THREADS} --platform="hifi" --model_path="/opt/models/${MODEL_NAME}" --output=${OUTPUT_DIR}            

Error:

Total time elapsed: 1666.97 s
Calling variants ...
Total processed positions in chrX (chunk 31/32) : 1562988
Total time elapsed: 1710.61 s
Calling variants ...
Total processed positions in chrX (chunk 32/32) : 1607233
Total time elapsed: 1713.87 s

real	495m51.744s
user	11216m21.472s
sys	472m31.393s
Traceback (most recent call last):
  File "/opt/bin/scripts/../clair3.py", line 105, in <module>
    main()
  File "/opt/bin/scripts/../clair3.py", line 99, in main
    submodule.main()
  File "/opt/bin/preprocess/SortVcf.py", line 308, in main
    sort_vcf_from(args)
  File "/opt/bin/preprocess/SortVcf.py", line 195, in sort_vcf_from
    ctg_name, pos = columns[0], columns[1]
IndexError: list index out of range

real	495m53.573s
user	11216m22.232s
sys	472m31.687s
INFO:    Cleaning up image...

@GACGAMA GACGAMA changed the title numpy.core._exceptions.MemoryError and IndexError: list index out of range IndexError: list index out of range Feb 6, 2025
@zhengzhenxian
Copy link
Collaborator

@GACGAMA

Seem there are too many candidates identified in your sample. Chromosome X has approximately 156,040,895 base pairs and was split into 32 chunks for parallel processing. In your log, for each chunk, about 32% (1,607,233 out of 4,876,277) of genomic positions were classified as candidates, leading to an OOM issue. Could you please check the IGV of the BAM file to verify if the reads are aligned properly?

@GACGAMA
Copy link
Author

GACGAMA commented Feb 6, 2025

Hi @zhengzhenxian
I'm new to long read data processing, that's why I started with a high-quality sample from a public database first.
But I'm not entirely sure how to assess the quality of alignment. I'm running AligmentQC to verify
The Bam file looks like this, on all chromosomes (indels <10 hidden, no quick consensus on IGV):

Image

With quick consensus, almost no SNP is found:

Image

@zhengzhenxian
Copy link
Collaborator

The error rate of the provided BAM file is significantly higher than other HiFi datasets.

We recommend using the latest HiFi data produced by PacBio if you are testing the performance of Clair3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants