Skip to content

dsIUPAC

BjornFJohansson edited this page Feb 18, 2025 · 7 revisions

This document defines an potential extension to the IUPAC DNA alphabet tailored for double stranded DNA. This extension is called dsIUPAC for now and allow unambiguous description of a double stranded DNA molecule with single stranded regions using a single sequence of characters.

IUPAC

The IUPAC DNA alphabet is a set of symbols designated by the International Union of Pure and Applied Chemistry (IUPAC) to represent nucleotide bases in DNA sequences, including ambiguity codes for cases where multiple nucleotides are possible at a particular position. Here are the symbols and their meanings:

  1. A - Adenine
  2. T - Thymine
  3. C - Cytosine
  4. G - Guanine

Ambiguity codes (representing multiple possible nucleotides):

  1. R - Purine (A or G)
  2. Y - Pyrimidine (C or T)
  3. S - Strong interaction (G or C)
  4. W - Weak interaction (A or T)
  5. K - Keto group (T or G)
  6. M - Amino group (A or C)
  7. B - Not A (C, G, or T)
  8. D - Not C (A, G, or T)
  9. H - Not G (A, C, or T)
  10. V - Not T (A, C, or G)
  11. N - Any nucleotide (A, T, C, or G)

These symbols allow for flexibility in representing DNA sequences, especially when there is uncertainty in base composition at specific positions. They do not address the single or double strandedness of DNA specifically.

dsIUPAC

Alphabet Symbol Complement Bases dsIUPAC extended meaning
IUPAC G C G G/C
" A T A A/T
" T A T T/A
" C G C C/G
" R Y G or A R/Y
" Y R T or C Y/R
" M K A or C M/K
" K M G or T K/M
" S S G or C S/S
" W W A or T W/W
" H D A or C or T H/D
" B V G or T or C B/V
" V B G or C or A V/B
" D H G or A or T D/H
" N N G or A or T or C N/N
dsIUPAC U O U in top strand, A in complementary strand U/A
" O U A in top strand, U in complementary strand A/U
"" E F A in top strand, complementary strand empty A/◻
" I J C " C/◻
" P Q G " G/◻
" X Z T " T/◻
" Z X A in complementary strand, top strand empty ◻/A
" Q P C " ◻/C
" J I G " ◻/G
" F E T " ◻/T

The symbols PEXI and QFZJ that are not occupied by the extended IUPAC alphabet were adopt to imply single stranded DNA.

The choice of symbols for the dsIUPAC extension facilitate intuitive recognition of compatible single stranded regions, i.e. sticky-ends.

Example

Two double stranded DNA molecules with compatible terminal 5'- single strand overhangs:

GATCaUaAa                   ad-hoc representation
    tAtUtCTAG         
    

PEXIaUaOaQFZJ               representation using dsIUPAC

We can easily recognize that alphabetically, P is followed by Q, E by F and I by J. This symmetry is only broken by the X, Z pair of necessity since Y is already used in the IUPAC alphabet.

DNA molecules with compatible terminal 3'- single strand overhangs:

QFZJaaaPEXI    QFZJaaaPEXI           representation using dsIUPAC

	aaaGATC        aaaGATC           ad-hoc representation
CTAGttt        CTAGttt

alphabets

ASCII CAPS  = ABCDEFGHIJKLMNOPQRSTUVWXYZ
IUPAC       = ABCD  GH  K MN   RST VW Y                
dsIUPAC     =     EF  IJ L  OPQ   U  X Z   +  IUPAC

punctuation = @ # $ % + > < * (still free)

Representations of double stranded DNA



PEXIGULAOCQFZJ

>format1 two strings & space 
GATCGUAAAC
    CAUTUGCTAG   

>format2 two strings & hyphen
GATCGUAAAC----
----CAUTUGCTAG

>format3 two strings & pipe
GATCGUAAAC||||
||||CAUTUGCTAG

>format4 three strings, pipe & hyphen
GATCGUAAAC----
||||||||||||||
----CAUTUGCTAG
Clone this wiki locally