-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL Escape Characters Converted #250
Comments
Here is some code to convert semicolons within quotes back into URL escape characters: tmp = 'gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"'
# Assumes the quote character in the 9th column is a double quote or <"> character. This is the
# correct character to use based on the speficiation. More information can be found on here:
# https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md#main-points-and-differences-between-gtf-formats
def url_escape_inside_quotes(line, delimiter=';', url_encoding = '%3B'):
quote_count = 0
inside_quotes = False
fixed = ''
for c in line:
if c == '"':
# Entered the border or ending of
# a quote, increase the counter and
# check where we are in the string
quote_count += 1
inside_quotes = True
if quote_count > 1:
# Reached end border of quote,
# reset boolean flag and counters
inside_quotes = False
quote_count = 0
if inside_quotes:
# Replace reserved delimeter with
# another character, let's use a
# url encoding of the character
if c == delimiter:
c = url_encoding
# Add the existing/converted character
fixed += c
return fixed
# gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"
print(url_escape_inside_quotes(tmp)) |
in GFF3 The piece of code dealing with that in AGAT is the same for GFF and GTF so I will try to fix that. |
Okay, that sounds good @Juke34. Thank you for taking the time to look deeper into this issue. I appreciate it! |
Describe the bug
agat_convert_sp_gff2gtf.pl
removes URL escape characters in the 9th column. In my testing, it removed a URL escape character in the 9th column which encodes for semicolons, i.e.;
character. After runningagat_convert_sp_gff2gtf.pl
, occurrences of%3B
are converted to;
. As I understand, these URL encodings are used to prevent issues with parsing the GTF file later.Is this behavior expected? Here is some documentation from your team. Please see the row about gff3 format. I already have a gff3 file (which is why the URL escape characters exist), but I would feel like the same rules would apply to GTF3 format. Wouldn't you want to avoid inserting a reserved delimiter character (like ';') within the value of a tag. This just makes parsing the file more of a headache later. I am not sure if the specification of gtf3 outlines how to handle said edge cases but it seems like retaining the URL escape character would be better.
I am interested to hear your thoughts.
Before (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gff3): contains
%3B
After (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gtf): converted
%3B
General (please complete the following information):
To Reproduce
I would just insert that character in a gff3 file you have and then run the following:
If you would like, I can provide you with the exact gff3 I am using. Please let me know what you think.
Expected behavior
I am not sure if this is expected behavior or not based on the specification of gtf3. Maybe there is no guidance, and we live in the wild, wild west.
The text was updated successfully, but these errors were encountered: