-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max Number of samples for PC-AIR #64
Comments
Hello. I suspect that you are using an old version of GENESIS. The pcair function used to call other R functions that limited the sample size to ~46K samples, but that was updated and resolved to allow for larger samples over a year ago. We've now successfully run PC-AiR on samples over 150K. Please try updating to the latest version. If you update to the latest version (or are already using it) and still have an issue, please let me know. |
Should GENESIS_2.20.1 be able to handle the larger sample size? We are getting segmentation faults with that version. How much memory per core is needed? |
Hi. Yes, v2.20.1 has the update to allow for larger sample sizes. Also, if you were running into that older issue, you wouldn't be getting segmentation faults; rather you would be getting a "cholmod" error in R. My best guess is that you need more memory for your analysis, but it's hard to say without more information about what you're seeing. For example, do you know at which stage in the analysis it is crashing? is it in the sample partitioning step, or the pca step, or the projection step, etc.? If you can provide any logs of your jobs, then I might be able to help further. There's no straightforward answer to how much memory is needed, because it's going to depend on sample size, the number of variants, and the file format of your If you can provide more specific information, let me know, and I'd be happy to try to help diagnose further. |
@mconomos Thanks for looking into this! Below is the log and the error message from our run with 2.21.4 version.
|
The error is actually coming from a function in the SNPRelate package. I recommend running |
@smgogarten This was run with the pcairPartition step. However, this time there was no error but the PCs all were NAs. Isn't this pointing to a PC-AIR issue rather than SNPRelate now?
Here is the code...
Here is the session info:
Here is the log, this time the script run through but the PCs results are NaNs:
|
From your |
@smgogarten We reinstalled with BiocMManager the 2.21.4 version which updated the gdsfmt version. We then reran genesis on the 53k samples. The current 1.26.1 gdsfmt did not resolve the issue. The PCs were all NAs as before.
|
I wish I could be more help, but I think something is wrong with your input data. I just ran this sequence of functions (with the same package versions) on a dataset with 49846 related and 14555 unrelated samples, and got expected results for all. Do you have missing genotypes in your data? Have you successfully run other SNPRelate functions on the same GDS file and gotten expected results? Could the number of cores you requested be more than SNPRelate can handle? (The max I've tried is 16.) The other thing that is different between my dataset and yours is that I'm using SeqArray formatted GDS files, and you appear to be using the older array-based GDS format (since SeqArray isn't listed in your sessionInfo). As far as I know that shouldn't make a difference, but it's the only other thing I can think of. Regardless of which of these things is the problem, you still might have better luck asking directly on the SNPRelate page, since the NAs are being produced by a SNPRelate function. |
Thanks for looking into this. I submitted an issue with SNPRelate to see if they have any suggestions. Is your script reading as input a VCF or plink file? We are using a plink file as input. |
@jjfarrell it looks like you submitted the issue on the SeqArray page instead of the SNPRelate page. If you're converting plink to GDS with SNPRelate::snpgdsBED2GDS, it's definitely a SNPRelate issue. We use VCF files as input, but since the conversion to GDS takes some time we do it only once for each dataset, then store the GDS file and use it as input to any subsequent scripts. For debugging, I would recommend not just running the entire script repeatedly, but breaking it down into its component steps and examining the output after each step. |
What is num.cores set to and how much memory does the compute node have for your test run? Our Job has 28 cores with 196 GB. |
Solution as described in SNPRelate #86: use the argument |
Is there a max number of samples PC-AIR will work on? We find when the number of samples increases from 45K to 55K, we are getting NA results. With samples less than 45K, it runs fine. Any suggestions or workarounds?
The text was updated successfully, but these errors were encountered: