Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

final kimura calculation missing for LINEs #173

Open
hy09 opened this issue Jan 8, 2025 · 2 comments
Open

final kimura calculation missing for LINEs #173

hy09 opened this issue Jan 8, 2025 · 2 comments

Comments

@hy09
Copy link

hy09 commented Jan 8, 2025

Hi Toby,
Thanks for adding the re-calculation of Kimura distance in the final gff file. However, I've noticed the values are missing (KIMURA80=nan) for a significant number of LINEs and LTRs. This is due to a discrepancy in repeat IDs. In RepeatMasker's .out file, strings "_3end", "_5end", or "_orf2" are removed after adjusting the positions. Thus the corresponding consensus sequence cannot be found and no error message is produced.

Here is an example line in the final .gff file:

1	Earl_Grey	LINE/L1	3054064	3054829	1592	-	.	TSTART=5335;TEND=6124;ID=L1MB4;SHORTTE=F;KIMURA80=nan

Corresponding repeat IDs in dfam database:

>L1MB4_3end#LINE/L1 @Eutheria [S:45,55]
>L1MB4_5end#LINE/L1 @Eutheria [S:55]

With LTRs, some of the repeatMasker outputs have "-INT" added and that caused the same problem.

Any comments would be appreciated. Thanks!

@TobyBaril
Copy link
Owner

This is due to using pre-existing libraries in addition to the de novo pipeline. @jamesdgalbraith we should be able to work something out for this I think?

@jamesdgalbraith
Copy link
Collaborator

I think this wouldnt be possible to implement, as the RepeatMasker GFF is what we use for the coordinates and identity of the repeats, and the divergence scripts were written to be compatible with de novo curated libraries rather than Dfam input.

To achieve this we'd need to fiddle with RepeatMasker's complex process of merging and adjusting coordinates, which J for one haven't been able to understand. Additionally we wouldn't be able to merge the repeats back together as the divergence is the genetic distance between the repeat sequence from the repeat in the genome and the consensus sequence it corresponds to, and in this case there's two or more seperate consensus sequences, so from a scientific perspective merging or averaging the distances makes no sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants