Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL Escape Characters Converted #250

Open
skchronicles opened this issue May 6, 2022 · 3 comments
Open

URL Escape Characters Converted #250

skchronicles opened this issue May 6, 2022 · 3 comments

Comments

@skchronicles
Copy link

skchronicles commented May 6, 2022

Describe the bug
agat_convert_sp_gff2gtf.pl removes URL escape characters in the 9th column. In my testing, it removed a URL escape character in the 9th column which encodes for semicolons, i.e. ; character. After running agat_convert_sp_gff2gtf.pl, occurrences of %3B are converted to ;. As I understand, these URL encodings are used to prevent issues with parsing the GTF file later.

Is this behavior expected? Here is some documentation from your team. Please see the row about gff3 format. I already have a gff3 file (which is why the URL escape characters exist), but I would feel like the same rules would apply to GTF3 format. Wouldn't you want to avoid inserting a reserved delimiter character (like ';') within the value of a tag. This just makes parsing the file more of a headache later. I am not sure if the specification of gtf3 outlines how to handle said edge cases but it seems like retaining the URL escape character would be better.

I am interested to hear your thoughts.

Before (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gff3): contains %3B

Chromosome	ena	ncRNA_gene	286157	288917	.	+	.	ID=gene:RrIowa_0339;biotype=rRNA;description=Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA;gene_id=RrIowa_0339;logic_name=ena_rna

After (Rickettsia_rickettsii_str_iowa_gca_000017445.ASM1744v3.49.gtf): converted %3B

Chromosome	ena	gene	286157	288917	.	+	.	gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene";

General (please complete the following information):

  • AGAT version: 0.8.0
  • Installed using singularity (from quay.io): see below
  • OS: CentOS

To Reproduce
I would just insert that character in a gff3 file you have and then run the following:

# Steps for converting messy gff into properly formatted GTF file
# 1. Pull image from registry and create SIF
# module load singularity 
SINGULARITY_CACHEDIR=$PWD singularity pull \
    docker://quay.io/biocontainers/agat:0.8.0--pl5262hdfd78af_0 

# 2. Run AGAT todo the heavy lifting of gtf conversion
singularity exec -B $PWD \
    agat_0.8.0--pl5262hdfd78af_0.sif agat_convert_sp_gff2gtf.pl \
        --gff input.gff \
        -o converted.gtf

If you would like, I can provide you with the exact gff3 I am using. Please let me know what you think.

Expected behavior
I am not sure if this is expected behavior or not based on the specification of gtf3. Maybe there is no guidance, and we live in the wild, wild west.

@skchronicles
Copy link
Author

skchronicles commented May 6, 2022

Here is some code to convert semicolons within quotes back into URL escape characters:

tmp = 'gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA; lsuRNA; 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"'

# Assumes the quote character in the 9th column is a double quote or <"> character. This is the 
# correct character to use based on the speficiation. More information can be found on here:
# https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md#main-points-and-differences-between-gtf-formats
def url_escape_inside_quotes(line, delimiter=';', url_encoding = '%3B'):
    quote_count = 0
    inside_quotes = False
    fixed = ''
    for c in line:
        if c == '"':
            # Entered the border or ending of 
            # a quote, increase the counter and
            # check where we are in the string
            quote_count += 1
            inside_quotes = True

            if quote_count > 1:
                # Reached end border of quote,
                # reset boolean flag and counters
                inside_quotes = False
                quote_count = 0

        if inside_quotes:
            # Replace reserved delimeter with 
            # another character, let's use a 
            # url encoding of the character
            if c == delimiter:
                c = url_encoding

        # Add the existing/converted character 
        fixed += c
    
    return fixed 

# gene_id "RrIowa_0339"; ID "gene:RrIowa_0339"; biotype "rRNA"; description "Large Subunit Ribosomal RNA%3B lsuRNA%3B 23S ribosomal RNA"; logic_name "ena_rna"; original_biotype "ncrna_gene"
print(url_escape_inside_quotes(tmp)) 

skchronicles added a commit to skchronicles/RNA-seek that referenced this issue May 6, 2022
@Juke34
Copy link
Collaborator

Juke34 commented May 17, 2022

in GFF3
URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.

The piece of code dealing with that in AGAT is the same for GFF and GTF so I will try to fix that.
GTF do not have any official rule about it. As they quote textual value, it is not a problem to escape it or not.

@skchronicles
Copy link
Author

skchronicles commented May 17, 2022

Okay, that sounds good @Juke34.

Thank you for taking the time to look deeper into this issue. I appreciate it!

@Juke34 Juke34 closed this as completed Nov 25, 2022
@Juke34 Juke34 reopened this Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants