You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi Toby,
Thanks for adding the re-calculation of Kimura distance in the final gff file. However, I've noticed the values are missing (KIMURA80=nan) for a significant number of LINEs and LTRs. This is due to a discrepancy in repeat IDs. In RepeatMasker's .out file, strings "_3end", "_5end", or "_orf2" are removed after adjusting the positions. Thus the corresponding consensus sequence cannot be found and no error message is produced.
This is due to using pre-existing libraries in addition to the de novo pipeline. @jamesdgalbraith we should be able to work something out for this I think?
I think this wouldnt be possible to implement, as the RepeatMasker GFF is what we use for the coordinates and identity of the repeats, and the divergence scripts were written to be compatible with de novo curated libraries rather than Dfam input.
To achieve this we'd need to fiddle with RepeatMasker's complex process of merging and adjusting coordinates, which J for one haven't been able to understand. Additionally we wouldn't be able to merge the repeats back together as the divergence is the genetic distance between the repeat sequence from the repeat in the genome and the consensus sequence it corresponds to, and in this case there's two or more seperate consensus sequences, so from a scientific perspective merging or averaging the distances makes no sense.
Hi Toby,
Thanks for adding the re-calculation of Kimura distance in the final gff file. However, I've noticed the values are missing (KIMURA80=nan) for a significant number of LINEs and LTRs. This is due to a discrepancy in repeat IDs. In RepeatMasker's .out file, strings "_3end", "_5end", or "_orf2" are removed after adjusting the positions. Thus the corresponding consensus sequence cannot be found and no error message is produced.
Here is an example line in the final .gff file:
Corresponding repeat IDs in dfam database:
With LTRs, some of the repeatMasker outputs have "-INT" added and that caused the same problem.
Any comments would be appreciated. Thanks!
The text was updated successfully, but these errors were encountered: