Skip to content

Entropy

Federico Mengozzi edited this page Mar 26, 2023 · 2 revisions

Shannon Entropy

Shannon entropy is calculate using the formula

$H_N(X) = - \dfrac{H(X)}{\log_2 n} = - \dfrac{\sum\limits_{i=1}^n P(x_i) \log_2 P(x_i)}{\log_2 n}$

which is simply the entropy normalized in the range $[0, 1]$

The entropy is calculate over chunks of data composed of symbols, with the possibility to specify the length of the symbol to consider.

For example, for the chunk 000001010011100101110111 entropy can be calculate with

  • slen = 2 resulting in 00 00 01 01 00 11 10 01 01 11 01 11
  • slen = 3 resulting in 000 001 010 011 100 101 110 111

Obviously this calculates a different value of the entropy depending on the symbol length

Another thing worth pointing out is how $P(x_i)$ is calculated (also here)

For example, given two chunk of data

  • s1 = 000 001 010
  • s2 = 000 001 010 011 100 101 110 111

for s1 it could make sense to calculate $P(x_i = 000) = P(x_i = 001) = P(x_i = 010) = 1/3$ if considering s1 the full alphabets of symbols or $P(x_i = 000) = P(x_i = 001) = P(x_i = 010) = 1/8$ if considering the alphabet as the set of all bit string of length 3.

In the end I opted for the second approach, the main motivation was that since I'm calculating the entropy of data with no particular assumption, I wanted to be as general as possible and I didn't like to limit what the data might look like.

With this decision, the program calculates

  • $H_N(s1) = 0.5283$
  • $H_N(s2) = 1$

the other approach (limiting the alphabet to the actual values in the chunk) the program would calculate

  • $H_N(s1) = 1$
  • $H_N(s2) = 1$

I had to revert to the sticking to the exact Shannon formula otherwise the results didn't make any sense and they were too much dependent of the choice of chunk and slen.

In particular to have meaningful results, it was required that $\dfrac{chunk}{slen} \backsimeq 2^{slen}$

Clone this wiki locally