Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract a single chromosome or a region of the chromosome from GBZ #4511

Open
jikhashkya opened this issue Jan 28, 2025 · 2 comments
Open

Extract a single chromosome or a region of the chromosome from GBZ #4511

jikhashkya opened this issue Jan 28, 2025 · 2 comments

Comments

@jikhashkya
Copy link

Hi,

I have a GBZ file from Minimap-cactus and I want to extract a single chromosome or a region of the chromosome as a way to subset the GBZ graph. Is it possible to do this directly using GBZ as the input graph? I looked into vg chunk and vg find but I didn't see any flags that explicitly supported extracting a chromosome or a region of the chromosome. The best I could do was run the following command:
vg chunk -x <input>.gbz -S <input>.snarl -p GRCh38#0#chr20:2000000-3000000 -O gfa > subgraph.gfa. I assume this provides the snarl in the chr20 in the given base pair range.

Any pointers to the proper documentation or tutorials would be greatly appreciated. I looked through the wiki but it didn't necessarily help so I apologize if I missed something obvious.

Thank you.

@xchang1
Copy link
Contributor

xchang1 commented Jan 29, 2025

Hi,

Your command looks right to me. Is it giving you an unexpected output?

If you want the path and all the nested variants too, then I think your command is right. Using the snarls can be a bit slow though.
If you're only interested in the path and the stuff close to it, you can use --context-steps or --context-length to walk out from the nodes along the path.
If you want the whole chromosome, you can use the --components flag that will give you the whole connected component.

@jikhashkya
Copy link
Author

jikhashkya commented Jan 31, 2025

It was giving the expected output however I just wanted to get a region without the snarl. When I tried to run the command with only -p flag, it requires me to use the flag with either -S or -c flags.
Thank you for your clarification. I think that kind of answers my question. Just to confirm, to extract a single chromosome's subgraph from the graph, I can run something like vg chunk -x <input>.gbz --components -p GRCh38#0#chr20 -O gfa > subgraph.gfa ?

Now, is there a way to filter the reads that are mapped within a certain region? For instance, if i have alig.gam file that was obtained from giraffe, and I want to only extract the reads that are mapped between node A and node B of the graph?

And is there a way to chunk the graph so that the haplotype paths are retained in the subgraph?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants