You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the get_frequencies() function the deletions and insertions are likely improperly handled.
Please read the mpileup specification. In one hand the -NUM+SEQ and the +NUM+SEQ notation only used at the preceeding positions where in the given read an ins or del is starting:
If there is an insertion after this read base...
If there is a deletion after this read base...
For dels also in the consecutive positons there is a "*" for each read that has gap/deletion there (that is not consuming the reference). So there is a slight confusion on adding "*" to deletion (which shouldn't be) as +/- notation is preceeding the deletion/insertion, however this is minor. The bigger issue is that you have "-" and "+" in your fastadict, so the elif for the -/+ notations (elif char in {"-", "+"}:) is not executed at all. Accordingly, the nucleotides in the -/+ ins/del notations (like -5NNNNN or worse +7ACTGACT) are calculated into the standard A, C, T, G base counts.
See the code:
def get_frequencies(
sequence: str
) -> Dict[str, int]:
fastadict = {"A": 0, "T": 0, "G": 0, "C": 0, "-": 0, "+": 0, "*": 0}
sequence = sequence.upper()
index = 0
while index < len(sequence):
char = sequence[index]
if char in fastadict:
fastadict[char] += 1
index += 1
elif char == "^":
index += 2
elif char in {"-", "+"}:
index += 1
digit, index = find_digit(sequence, index)
index += digit
else:
index += 1
fastadict["-"] += fastadict["*"]
del fastadict["*"]
return fastadict
In case of BAM deletion starts will be represented as -NUMBER following NUMBER amount of 'N's, in case of CRAMs even for deletions you will have the reference nucleotides (and not N's) as you have the reference provided by '-f'.
Actually the ins/del data is dropped in extract_haplogroups(), so it is not used for Hg classification:
However in case of find_private_mutations() it is called directly, and the "-" and "+" keys are not dropped, although deletions may have invalid counts (due to adding potentially overlapping deletions from other reads, see "*" notation).
For private mutations it could be useful if not only the order of the -/+ notation test would be changed but you could also provide useful informatin here.
In case of deletions, if there is lets say an 5bp deletion in most cases it is still a single mutation so it would be bad practice to emit 5 private markers like delT, delA, delC, delA, delA for 5 consecutive positions. Here you should rather emit a single delTACAA instead if all dels are that in the pileup.. For insertions it is the same (luckily they are not plagued by the "*" extra notation like deletions).
Since ins could be variable base composition, and alsop ins/dels are more variable likely it would be easier to create a dict for SNPs and a dict for insdel. That way you can skip insdel info at hg classification (ou are not expecting any at the predefined positions), and you can still use this information for private marker identification.
So something similar could parse the data likely more correctly:
def get_frequencies(
sequence: str
) -> Tuple[Dict[str, int], Dict[str, int]]:
fastadict = {"A": 0, "T": 0, "G": 0, "C": 0}
insdeldict = defaultdict(int)
sequence = sequence.upper()
index = 0
while index < len(sequence):
char = sequence[index]
if char in fastadict:
fastadict[char] += 1
index += 1
elif char == "-":
index += 1
digit, index = find_digit(sequence, index)
# when no -f is used in pileup the deletion is denoted by N's and not the sequence from the reference
# however we can fix it at provate markers, and also same pos same length del means the same deletion
insdelsdict[f"del{digit}] += 1
index += digit
elif char == "+":
index += 1
digit, index = find_digit(sequence, index)
# insertion sequences are in the read, so they are also included in the pileup with or without -f option
ins = f"ins{sequence[index:index+digit]}"
insdelsdict[ins] += 1
index += digit
elif char == "^":
index += 2
else:
index += 1
return fastadict,insdeldict
This means that you have to ix up code where you use this information (no longer needed to drop -/+) and also would required to fix up the private marker part to fix up dels to delSEQFROMREFERENCE notation (as there the fasta reference is available but from bam the DELs are just seen is Ns). That way ins and del could be denoted same style (insACTG delTTCA...) or whatever you prefer.
Z
The text was updated successfully, but these errors were encountered:
In the get_frequencies() function the deletions and insertions are likely improperly handled.
Please read the mpileup specification. In one hand the -NUM+SEQ and the +NUM+SEQ notation only used at the preceeding positions where in the given read an ins or del is starting:
For dels also in the consecutive positons there is a "*" for each read that has gap/deletion there (that is not consuming the reference). So there is a slight confusion on adding "*" to deletion (which shouldn't be) as +/- notation is preceeding the deletion/insertion, however this is minor. The bigger issue is that you have "-" and "+" in your fastadict, so the elif for the -/+ notations (
elif char in {"-", "+"}:
) is not executed at all. Accordingly, the nucleotides in the -/+ ins/del notations (like -5NNNNN or worse +7ACTGACT) are calculated into the standard A, C, T, G base counts.See the code:
In case of BAM deletion starts will be represented as -NUMBER following NUMBER amount of 'N's, in case of CRAMs even for deletions you will have the reference nucleotides (and not N's) as you have the reference provided by '-f'.
Actually the ins/del data is dropped in extract_haplogroups(), so it is not used for Hg classification:
However in case of find_private_mutations() it is called directly, and the "-" and "+" keys are not dropped, although deletions may have invalid counts (due to adding potentially overlapping deletions from other reads, see "*" notation).
For private mutations it could be useful if not only the order of the -/+ notation test would be changed but you could also provide useful informatin here.
In case of deletions, if there is lets say an 5bp deletion in most cases it is still a single mutation so it would be bad practice to emit 5 private markers like delT, delA, delC, delA, delA for 5 consecutive positions. Here you should rather emit a single delTACAA instead if all dels are that in the pileup.. For insertions it is the same (luckily they are not plagued by the "*" extra notation like deletions).
Since ins could be variable base composition, and alsop ins/dels are more variable likely it would be easier to create a dict for SNPs and a dict for insdel. That way you can skip insdel info at hg classification (ou are not expecting any at the predefined positions), and you can still use this information for private marker identification.
So something similar could parse the data likely more correctly:
This means that you have to ix up code where you use this information (no longer needed to drop -/+) and also would required to fix up the private marker part to fix up dels to delSEQFROMREFERENCE notation (as there the fasta reference is available but from bam the DELs are just seen is Ns). That way ins and del could be denoted same style (insACTG delTTCA...) or whatever you prefer.
Z
The text was updated successfully, but these errors were encountered: