-
Notifications
You must be signed in to change notification settings - Fork 0
Entropy
Shannon entropy is calculate using the formula
which is simply the entropy normalized in the range
The entropy is calculate over chunks of data composed of symbols, with the possibility to specify the length of the symbol to consider.
For example, for the chunk 000001010011100101110111
entropy can be calculate with
-
slen = 2
resulting in00 00 01 01 00 11 10 01 01 11 01 11
-
slen = 3
resulting in000 001 010 011 100 101 110 111
Obviously this calculates a different value of the entropy depending on the symbol length
Another thing worth pointing out is how
For example, given two chunk of data
s1 = 000 001 010
s2 = 000 001 010 011 100 101 110 111
for s1
it could make sense to calculate s1
the full alphabets of symbols or
In the end I opted for the second approach, the main motivation was that since I'm calculating the entropy of data with no particular assumption, I wanted to be as general as possible and I didn't like to limit what the data might look like.
With this decision, the program calculates
$H_N(s1) = 0.5283$ $H_N(s2) = 1$
the other approach (limiting the alphabet to the actual values in the chunk) the program would calculate
$H_N(s1) = 1$ $H_N(s2) = 1$
I had to revert to the sticking to the exact Shannon formula otherwise the results didn't make any sense and they were too much dependent of the choice of chunk
and slen
.
In particular to have meaningful results, it was required that