Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max Number of samples for PC-AIR #64

Closed
jjfarrell opened this issue Mar 30, 2021 · 13 comments
Closed

Max Number of samples for PC-AIR #64

jjfarrell opened this issue Mar 30, 2021 · 13 comments

Comments

@jjfarrell
Copy link

Is there a max number of samples PC-AIR will work on? We find when the number of samples increases from 45K to 55K, we are getting NA results. With samples less than 45K, it runs fine. Any suggestions or workarounds?

@mconomos
Copy link
Contributor

Hello. I suspect that you are using an old version of GENESIS. The pcair function used to call other R functions that limited the sample size to ~46K samples, but that was updated and resolved to allow for larger samples over a year ago. We've now successfully run PC-AiR on samples over 150K. Please try updating to the latest version.

If you update to the latest version (or are already using it) and still have an issue, please let me know.

@jjfarrell
Copy link
Author

Should GENESIS_2.20.1 be able to handle the larger sample size? We are getting segmentation faults with that version. How much memory per core is needed?

@mconomos
Copy link
Contributor

mconomos commented Apr 9, 2021

Hi. Yes, v2.20.1 has the update to allow for larger sample sizes. Also, if you were running into that older issue, you wouldn't be getting segmentation faults; rather you would be getting a "cholmod" error in R.

My best guess is that you need more memory for your analysis, but it's hard to say without more information about what you're seeing. For example, do you know at which stage in the analysis it is crashing? is it in the sample partitioning step, or the pca step, or the projection step, etc.? If you can provide any logs of your jobs, then I might be able to help further.

There's no straightforward answer to how much memory is needed, because it's going to depend on sample size, the number of variants, and the file format of your kinobj and divobj.

If you can provide more specific information, let me know, and I'd be happy to try to help diagnose further.

@jjfarrell
Copy link
Author

@mconomos Thanks for looking into this! Below is the log and the error message from our run with 2.21.4 version.

Using kinobj and divobj to partition samples into unrelated and related sets
Working with 52877 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Identifying pairs of divergent samples using divergence threshold -0.0220970869120796
Partitioning samples into unrelated and related sets...
    ...1000 samples added to related.set...
    ...2000 samples added to related.set...
    ...3000 samples added to related.set...
Unrelated Set: 49156 Samples
Related Set: 3721 Samples
Performing PCA on the Unrelated Set...
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1857235516
CPU capabilities: Double-Precision SSE2
Sat Apr 10 08:01:52 2021    (internal increment: 64)
[==================================================] 100%, completed, 45.5m
Sat Apr 10 08:47:24 2021    Begin (eigenvalues and eigenvectors)

 *** caught segfault ***
address 0x13089c0, cause 'memory not mapped'

Traceback:
 1: snpgdsPCA(gdsobj, sample.id = part$unrels, snp.id = snp.include,     num.thread = num.cores, verbose = verbose, ...)
 2: withCallingHandlers(expr, message = function(c) if (inherits(c,     classes)) tryInvokeRestart("muffleMessage"))
 3: suppressMessages(snpgdsPCA(gdsobj, sample.id = part$unrels, snp.id = snp.include,     num.thread = num.cores, verbose = verbose, ...))
 4: .pcair(gdsobj, ...)
 5: pcair(gdsobj = gds_obj, num.cores = 28, kinobj = kingMat, divobj = kingMat)
 6: pcair(gdsobj = gds_obj, num.cores = 28, kinobj = kingMat, divobj = kingMat)
An irrecoverable exception occurred. R is aborting now ...
/var/spool/sge/scc-zm4/job_scripts/5828043: line 6: 173064 Segmentation fault      Rscript --vanilla $1

@smgogarten
Copy link
Collaborator

The error is actually coming from a function in the SNPRelate package. I recommend running pcairPartition to find the unrelated set, then calling snpgdsPCA directly using the unrelated set as the sample.id argument (this is what pcair does internally). Then report the error, along with the command you ran and the output of sessionInfo(), on the SNPRelate GitHub page.

@jjfarrell
Copy link
Author

@smgogarten This was run with the pcairPartition step. However, this time there was no error but the PCs all were NAs. Isn't this pointing to a PC-AIR issue rather than SNPRelate now?

PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10    sample.id
-0.0120138984629118     -0.00155873339950702    -0.00110230516591038    -9.77996252854696e-05   0.000395412453209808    -0.00152135811291193    -1.45423707942342e-05   -0.00201312188579735    -0.00394751800943769    -0.00282981157052519    08AD09144_NACC022453
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD8245_NACC657970
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD9080_NACC594499
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD9797
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD10457
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD10987
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11073
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11218
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11219

Here is the code...

kingMat <- readRDS("adgc.kingMat.rds")

sampleset <- pcairPartition(kinobj =kingMat,divobj =kingMat)
length(sampleset$rels)
length(sampleset$unrels)

#pca on unrels
pca.unrel <- snpgdsPCA(gds_obj, sample.id=sampleset$unrels,num.thread = coreNum)
saveRDS(pca.unrel, "pca.unrel.rds")
# project values for relatives
snp.load  <- snpgdsPCASNPLoading(pca.unrel, gdsobj=gds_obj,num.thread = coreNum)
samp.load <- snpgdsPCASampLoading(snp.load, gdsobj=gds_obj, sample.id=sampleset$rels,num.thread = coreNum)

Here is the session info:

R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /share/pkg.7/r/4.0.0/install/lib/libopenblasp-r0.3.7.so
LAPACK: /share/pkg.7/r/4.0.0/install/lib/liblapack.so.3.9.0

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base    

other attached packages:
[1] GENESIS_2.21.4      GWASTools_1.36.0    Biobase_2.47.3    
[4] BiocGenerics_0.33.3 SNPRelate_1.24.0    gdsfmt_1.23.13    

loaded via a namespace (and not attached):
 [1] zoo_1.8-8              tidyselect_1.1.0       purrr_0.3.4          
 [4] DNAcopy_1.61.0         splines_4.0.0          lattice_0.20-41      
 [7] vctrs_0.3.0            generics_0.0.2         stats4_4.0.0          
[10] GWASExactHW_1.01       mgcv_1.8-31            blob_1.2.1            
[13] survival_3.1-12        rlang_0.4.6            pillar_1.4.4          
[16] glue_1.4.1             DBI_1.1.0              bit64_0.9-7          
[19] GenomeInfoDbData_1.2.3 foreach_1.5.0          lifecycle_0.2.0      
[22] zlibbioc_1.33.1        Biostrings_2.55.7      MatrixModels_0.4-1    
[25] codetools_0.2-16       memoise_1.1.0          SeqArray_1.27.18      
[28] IRanges_2.21.8         SparseM_1.78           GenomeInfoDb_1.23.17  
[31] lmtest_0.9-37          quantreg_5.55          broom_0.5.6          
[34] Rcpp_1.0.4.6           backports_1.1.7        quantsmooth_1.53.0    
[37] S4Vectors_0.25.15      XVector_0.27.2         bit_1.1-15.2          
[40] digest_0.6.25          dplyr_1.0.0            GenomicRanges_1.39.3  
[43] grid_4.0.0             bitops_1.0-6           sandwich_2.5-1        
[46] magrittr_1.5           RCurl_1.98-1.2         tibble_3.0.1          
[49] RSQLite_2.2.0          mice_3.9.0             crayon_1.3.4          
[52] tidyr_1.1.0            pkgconfig_2.0.3        ellipsis_0.3.1        
[55] Matrix_1.2-18          data.table_1.14.0      logistf_1.23          
[58] iterators_1.0.12       SeqVarTools_1.25.2     R6_2.4.1              
[61] nlme_3.1-148           compiler_4.0.0   

Here is the log, this time the script run through but the PCs results are NaNs:
/restricted/projectnb/adgc/zhucc/ADGC_data/pheno/PCs/script_53k_twoStepsPCS/adgc.pc-air.pcs.txt

Working with 52877 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Identifying pairs of divergent samples using divergence threshold -0.0220970869120796
Partitioning samples into unrelated and related sets...
    ...1000 samples added to related.set...
    ...2000 samples added to related.set...
    ...3000 samples added to related.set...
[1] 3721
[1] 49156
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1857235516
CPU capabilities: Double-Precision SSE2
Fri Apr 16 22:45:20 2021    (internal increment: 64)
[==================================================] 100%, completed, 43.8m
Fri Apr 16 23:29:14 2021    Begin (eigenvalues and eigenvectors)
Sat Apr 17 10:09:42 2021    Done.
SNP Loading:
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
SNP Loading:    the sum of all selected genotypes (0,1,2) = 1857235516
Sat Apr 17 10:09:50 2021    (internal increment: 444)
[==================================================] 100%, completed, 7s  
Sat Apr 17 10:09:57 2021    Done.
Sample Loading:
    # of samples: 3,721
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
Sample Loading:    the sum of all selected genotypes (0,1,2) = 140973579
Sat Apr 17 10:10:01 2021    (internal increment: 5908)
[==================================================] 100%, completed, 4s  
Sat Apr 17 10:10:05 2021    Done.

@smgogarten
Copy link
Collaborator

From your sessionInfo I see that although you have current versions of GENESIS and SNPRelate, your version of gdsfmt is out of date: current is 1.26.1 and you have 1.23.13. This mismatch might be causing the problem, as SNPRelate depends closely on gdsfmt and they are often updated in tandem. Using BiocManager::install to install Bioconductor packages in a current R version is recommended, as it ensures that the versions of all installed packages were tested together.

@jjfarrell
Copy link
Author

@smgogarten We reinstalled with BiocMManager the 2.21.4 version which updated the gdsfmt version. We then reran genesis on the 53k samples. The current 1.26.1 gdsfmt did not resolve the issue. The PCs were all NAs as before.

log info:
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /share/pkg.7/r/4.0.3/install/lib/libopenblasp-r0.3.7.so
LAPACK: /share/pkg.7/r/4.0.3/install/lib/liblapack.so.3.9.0

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base    

other attached packages:
[1] GENESIS_2.20.1      GWASTools_1.36.0    Biobase_2.50.0    
[4] BiocGenerics_0.36.1 SNPRelate_1.24.0    gdsfmt_1.26.1      

loaded via a namespace (and not attached):
 [1] quantsmooth_1.56.0     Rcpp_1.0.6             lattice_0.20-41      
 [4] tidyr_1.1.2            Biostrings_2.58.0      formula.tools_1.7.1  
 [7] zoo_1.8-8              digest_0.6.27          lmtest_0.9-38        
[10] foreach_1.5.1          R6_2.5.0               GenomeInfoDb_1.26.1  
[13] backports_1.2.1        MatrixModels_0.5-0     stats4_4.0.3          
[16] RSQLite_2.2.1          pillar_1.4.7           zlibbioc_1.36.0      
[19] rlang_0.4.9            rstudioapi_0.13        data.table_1.14.0    
[22] SparseM_1.81           blob_1.2.1             S4Vectors_0.28.1      
[25] Matrix_1.2-18          splines_4.0.3          GWASExactHW_1.01      
[28] RCurl_1.98-1.3         bit_4.0.4              broom_0.7.2          
[31] compiler_4.0.3         pkgconfig_2.0.3        mgcv_1.8-33          
[34] tidyselect_1.1.0       tibble_3.0.4           GenomeInfoDbData_1.2.4
[37] DNAcopy_1.64.0         IRanges_2.24.0         codetools_0.2-18      
[40] matrixStats_0.57.0     crayon_1.4.1           dplyr_1.0.2          
[43] conquer_1.0.2          bitops_1.0-6           grid_4.0.3            
[46] nlme_3.1-150           lifecycle_0.2.0        DBI_1.1.1            
[49] magrittr_2.0.1         SeqVarTools_1.28.1     XVector_0.30.0        
[52] mice_3.12.0            ellipsis_0.3.1         generics_0.1.0        
[55] vctrs_0.3.5            sandwich_3.0-0         tools_4.0.3          
[58] iterators_1.0.13       bit64_4.0.5            glue_1.4.2            
[61] purrr_0.3.4            survival_3.2-7         SeqArray_1.30.0      
[64] GenomicRanges_1.42.0   operator.tools_1.6.3   memoise_1.1.0        
[67] logistf_1.24           quantreg_5.75        
Using kinobj and divobj to partition samples into unrelated and related sets
Working with 52877 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Identifying pairs of divergent samples using divergence threshold -0.0220970869120796
Partitioning samples into unrelated and related sets...
    ...1000 samples added to related.set...
    ...2000 samples added to related.set...
    ...3000 samples added to related.set...
[1] 3721
[1] 49156
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1857235516
CPU capabilities: Double-Precision SSE2
Thu Apr 22 17:04:43 2021    (internal increment: 92)
[==================================================] 100%, completed, 44.4m
Thu Apr 22 17:49:15 2021    Begin (eigenvalues and eigenvectors)
Fri Apr 23 11:09:15 2021    Done.
SNP Loading:
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
SNP Loading:    the sum of all selected genotypes (0,1,2) = 1857235516
Fri Apr 23 11:09:22 2021    (internal increment: 740)
[==================================================] 100%, completed, 7s  
Fri Apr 23 11:09:29 2021    Done.
Sample Loading:
    # of samples: 3,721
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
Sample Loading:    the sum of all selected genotypes (0,1,2) = 140973579
Fri Apr 23 11:09:34 2021    (internal increment: 9780)
[==================================================] 100%, completed, 4s  
Fri Apr 23 11:09:38 2021    Done. 

@smgogarten
Copy link
Collaborator

I wish I could be more help, but I think something is wrong with your input data. I just ran this sequence of functions (with the same package versions) on a dataset with 49846 related and 14555 unrelated samples, and got expected results for all. Do you have missing genotypes in your data? Have you successfully run other SNPRelate functions on the same GDS file and gotten expected results? Could the number of cores you requested be more than SNPRelate can handle? (The max I've tried is 16.)

The other thing that is different between my dataset and yours is that I'm using SeqArray formatted GDS files, and you appear to be using the older array-based GDS format (since SeqArray isn't listed in your sessionInfo). As far as I know that shouldn't make a difference, but it's the only other thing I can think of.

Regardless of which of these things is the problem, you still might have better luck asking directly on the SNPRelate page, since the NAs are being produced by a SNPRelate function.

@jjfarrell
Copy link
Author

Thanks for looking into this. I submitted an issue with SNPRelate to see if they have any suggestions. Is your script reading as input a VCF or plink file? We are using a plink file as input.

@smgogarten
Copy link
Collaborator

@jjfarrell it looks like you submitted the issue on the SeqArray page instead of the SNPRelate page. If you're converting plink to GDS with SNPRelate::snpgdsBED2GDS, it's definitely a SNPRelate issue.

We use VCF files as input, but since the conversion to GDS takes some time we do it only once for each dataset, then store the GDS file and use it as input to any subsequent scripts. For debugging, I would recommend not just running the entire script repeatedly, but breaking it down into its component steps and examining the output after each step.

@jjfarrell
Copy link
Author

jjfarrell commented Apr 30, 2021

What is num.cores set to and how much memory does the compute node have for your test run? Our Job has 28 cores with 196 GB.

@smgogarten
Copy link
Collaborator

Solution as described in SNPRelate #86: use the argument algorithm="randomized" in pcair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants