Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using fix_targetfile messes up with the target file #115

Open
josh3397 opened this issue Apr 14, 2023 · 3 comments
Open

Using fix_targetfile messes up with the target file #115

josh3397 opened this issue Apr 14, 2023 · 3 comments

Comments

@josh3397
Copy link

Hi,

I've been reading the section on improving the target file using fix_targetfile option by removing low complexity regions. I noticed that using this option is messing up with the fasta file headers in my target protein dataset. Anybody else faced this issue?

@chrisjackson-pellicle
Copy link
Collaborator

Hi @josh3397,

Can you provide more details on this, please? An example of your input, and the incorrect output?

Cheers,

Chris

@josh3397
Copy link
Author

Hello,

My input protein targets look like this -

Bombyx_mori-BGIBMGA010725
VLASLHNGVIQLWDYRMCTLLEKFDEHDGPVRGICFHIQQPLFVSGGDDYKIKVWNYKQRRCLFTLLGHLDYIRTTFFHHEYPWILSASDDQTIRIWNWQSRQCISVLTGHNHYVMCAQFHPTEDLLVSASLDQSVRVWDFSGLRKKSVAPGPAGLTEHLRNPQATDLFG-----QADAVVKHVLEGHDRGVNWACFHPTLPLIASAADDRQVKLWRMNDSKAWEVDTCRGHYNNVSCVLFHAKHELIISNSEDLYIRVWDMSKRTLLQSFRREHERH----------------------------------------------LRKLDMLTNRDAPVMHLSKGGGRQQPYS------------------------------------MSLNHAEWCVLVTWRNG---ENCSYELYAAPRD--HSGAVPEGTEPARGQATTAIW--------------LVIRNLKNEVSKKISTPTCEEIMYAGTGMLLLREADSVQLLDVQQKRTIASVKVSKCRYAIWNTDMSIVALLGKHTVTLCTKKLEQLCCITEGARVKSGAFDDSTPQPVFIYTTANHIKYCCKDGDYGIIRTLDVPVYAVRVLS-----TETGARVVCLDRECRPKVLNIDPTEY-RFKLALVTRQYDQVLHMVRTAKLVGQSIIAYLQEKGYPEVALHFVKDSRTRLSLALQCGNIEVALEAAKSLDEPAAWDQLAKAALTTGNHQIVEMCYQRTKNFDKLSFLYLVTGNLEKLRKMMKIAEIRKDTSAQFQGALLLGDVGERIRLLRNSGQLSLAYLTAINHKQVEEAEQLKAALEAAGMPVPEANPDAVFLRPPLPIQRNQPNWPLLAVSKSFFEVAGQARAAAEGSGSAVAAA-LDEPLEAAGAWGDDDVL----PDHKEEGEEEIMEDACEDGGWDVGDEDLELPEELAPVSADMGAAEDSEQYFVAPTRGASAPLAARLRTAHDHVATGQFEAAMRLLNEQVGIVNFAPYESVFAEMFAHARVTFGALPSLPALTAYLHRNWKEATGKDLLPVITLKLSDLVSQLQQSYQLTTAGRFPEAIERLQGVAQRVPLLLVDSKAELSEAQQLLAVCRDYLVGLAMETARKAMPKNTVDEQKRTCEMAAYFTHCKLQPVHQILTLRTALNMFFKLKNYRTAASFARRLLELGPRPEVAQQARKILQACEKTPTDEHQLLYDEHNPFSVCGISYKPIYRGKPEEKCSLCAASFMPEHKGKLCPVCGVAEIGKDALGLRICPLQFNR
--
Bombyx_mori-BGIBMGA000829
-------------------------------------MYLYNLTLQGSSAITHAVHGNFSGTKQQEIIISRGKTLELLRPDPNTGKVHTFMKIEIFGVVRSIMAFRLTGGTKDYIVVGSDSGRIIILEYIPTKNILEKVHQETFGKSGCRRIVPGQYLSIDPKGRAVMIGAIEKQKLVYILNRDAEARLTISSPLEAHKSNTLVYHMVGVDVGFENPMFACLEIDYEEADSDPTGEAAQKTQQTLTFYELDLGLNHVVRKYSEPLEEHANFLITVPGGNDGPSGVLICSENYLTYKNLGDQHDIRCPIPRRRNDLDDPERGMIFVCSATHKTKSMFFFLAQTEQGDIFKITIETDEDMVTEIKLKYFDTVPVATSMCVLKTGFLFVACEFGNHYLYQIAHLGDEDDEPEFSSAMPLEEGDTFFFAPRPLRNLVLVDEMDSLSPILACHVADLAGEDTPQVYLACGRGPRSSLRALRHGLEVAEMAVSELPGSPNAVWTVRRNKDEEYDSYIIVSFVNATLVLSIGETVEEVTDSGFLGTTPTLSCHAMGNDALVQVYPDGIRHIRADKRVNEWKAPGKKSIVRCAVNQRQVVIALTGGELVYFEMDPTGQLNEYTERKKLSSDVCCMALGSVAAGEQRAWFLAVGLNDNTVRIISLDPADCLSPRSMQALPAGAESLCIIEQPFESGAKSALHLNIGLSNGVLLRTTLDSVSGDLADTRTRYLGSRPVKLFKVRVQAAEAVLAVSSRTWLGYHYQNRFHLTPLSYECLEYAAGFSSEQCTEGIVAISSNTLRILALEKLGAVFNQTFVPLEYTPRKFIINSDNNHIIVLETDHNAYTEEMKKH-RRIQMAQEMREAAA-GGAPEEQQLANEMADAFLSDTLPEYIFSSPKAGAGMWASLIRVVDMGIGGG--QPNTLFRL-PLEQNEAAVSLCIVRWAAHAEHAQPHLVVGVAKDL-ILSPRSCTEGSLHVYKIYGNTGKLELVHKTPVDEYPGAIAAFNGRLLAGVGRMLRLYDIGRRKLLRKCENRHIPNLIADIKTIGQRIFVSDVQESVFCVKHKKRENQLIIFADDTNPRWITNSCILDYDTIAVSDKFGNVAIMRLPQSVSDDVDEDPTGNKALWDRGLLNGASQKGDVVVNFHVGETVTSLQRATLIPGGSEALLYATISGSLGVLLPFTSREDHDFFQHLEMHMRSENSPLCGRDHLSFRSYYYPVKNVIDGDLCEQFNSLDPGKQKAIAGDLERTPAEVSKKLEDIRTRYAF
--

After using fix_targetfile to remove low complexity regions, I get the output where fasta headers of many sequences is messed up like this:

Heliconius_melpomene-HMEL007836
ADLVVIGSGPGGYVAAIKAAQLGLKTISVEKDPSLGGTCLNVGCIPSKALLHNSHLY>He
liconius_melpomene-HMAKHDFKHRGIDVGEVKFDFDAMMAYKSNAVKGLTGGIAM
LFNKNKVQLVRGVGSVVAPNKVEVQGEKG-VETINTKNIIIASGSEVTPFPGVTFDEQQI
ITSTGALSLSKVPKKMLVIGAGVIGLELGSVYQRLGADVTAIEFLESIGGIGIDGEVSKT
LQKILTKQGMKFKLGTKVTAVKKEGGVVKIEVEAAKGGNKETLDCDVVLISIGRRPYTKG
LGLEKVGIALDDRGRIPVNNKFQTTIPGIYAIGDVIHGPMLAHKAEDEGIVCVEGIKGMP
VHFNYDAIPSVIYTSPEVGWVGKSEEDLKK-EGRAYKVGKFPFMANSRAKTNGEPEGFVK

Heliconius_melpomene-HMEL010336
CPYLDTINRHVLDFDFEKLCSISLTRINVYACLVCGKYFQGRGTNTHAYTHSVADGHHVF
LNLHTLKFYCLPDNYEVIDSSLNDIKYVLNPIFTPEQIKQLDENTKMSRAIDGTMYMPGI
VGLNNIKANDYCNVILQCLSQVRPLRNYFLREENYADVKRPPGDSSFLLVQRFGELIRKL
WNPRAFKAHVSPHEMLQAVVLWSKKRFQFIKQSDPIDFLSWFLNSLHLALNGTKKP-NSS
IIYKSFLG>Heliconius_melpomene-HMRIYTRKLPPPDADDAAKVDLSSEEYNEM
ITESPFLYLTCDLPPTPLFTDEFRENIIPQVNLYQLLSKFNGQTSKEYKTYKENFMKRFE
ITQLPPYLILYIKRFTKNTFFVEKNPTVVNFPVKNVDFGDILTPEVKAKHNGKTTYELVG
NIVHDGTPEKGTYRAHVLHTPTQQWYEMQDLHVTSILPQMITLTEAYIQIYELKQD--

Regards,
Mukta

@chrisjackson-pellicle
Copy link
Collaborator

Hi Mukta,

It looks like your target file contains alignments, rather that unaligned sequences?

Can you upload your target file, and paste the hybpiper check_targetfile and hybpiper_fix_targetfile you used?

Cheers,

Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants