-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More appropriate usage of NThreshold #21
base: master
Are you sure you want to change the base?
Conversation
Current coverage is 88.97% (diff: 100%)@@ master #21 diff @@
==========================================
Files 25 25
Lines 1690 1696 +6
Methods 0 0
Messages 0 0
Branches 311 312 +1
==========================================
+ Hits 1487 1509 +22
+ Misses 133 117 -16
Partials 70 70
|
Thank you for thoughts... I'm a bit not sure about this change however... I guess the concrete examples of discord finding in various time series would help. It'll be really helpful if you have time to run few experiments... I'll also have time to do it later on, then will decide on the change... Thanks! |
I use the datasets listed in http://www.cs.ucr.edu/~eamonn/discords/. For ECG dataset - xmitdb_x108_0, there are two sequences within this dataset. And I have tried several NThreshold values ranged from 0.01 to 0.2 with the interval of 0.01 (0.01, 0.02, 0.03 ..... 0.2) to both of them. Other parameters are fixed as: discordsNumToReport = 1, windowSize = 200, paaSize = 5, alphabetSize = 3, strategy = NumerosityReductionStrategy.NONE. Here are the results obtained from your implementation: Second sequence: Supplement of my opinion: In "Finding the most unusual time series subsequence: algorithms and applications", the authors mentioned that the cardinality of the SAX alphabet size a, and the SAX word size w only affect the efficiency of our algorithm, not the final result, which depends only on the user supplied length of the discord (First paragraph of 4.2). Although NThreshold is not mentioned in this paper, in essence it's also one of the SAX related parameters as previous two. If Nthreshold influences the final result, this will introduce additional complexity to users when they set parameters of HOT-SAX. Here are the results obtained from modified implementation: Second sequence: For ann_gun_CentroidA, there are also two sequences within this dataset. I ran your implementation with similar parameters as previous experiment (discordsNumToReport = 1, windowSize = 200, paaSize = 5, alphabetSize = 3, strategy = NumerosityReductionStrategy.NONE, NThreshold = 0.01 ~ 0.2). This time all locations of discords I got are identical. The reason for this result is the uniqueness of the sequences (the only discord is quite predominant within the whole sequence), therefore the value of NThreshold won't influence the final discord result. I'm not sure if these are the experiments you want. You can tell me if there are still some aspects need to be confirmed, I'm willing to offer help. |
Thank you for running the experiment, it makes perfect sense. I now recall that the first implementation wasn't based on z-normalized subsequences, but then we changed that for some reason... I'll talk with Jessica this Wednesday to see what she thinks about that and will get back. |
NThreshold (normalization threshold value) can be used in SAX to deal with one special case (normalization to a subsequence which is almost constant) as mentioned in "Experiencing SAX: a novel symbolic representation of time series". It assigns the entire word to the middle-ranged alphabet when the standard deviation of current subsequence is slower than NThreshold.
However, I think it's not appropriate to use it in the z-normalization step when we are going to do actual distance calculations of subsequence pairs. In your implementation, the "znorm" method within class "TSProcessor" will return array of zeros when an input subsequence have a small standard deviation (smaller than NThreshold value like 0.01). By doing so, a subsequence with little fluctuation will be considered as a horizontal straight line, the original information of this subsequence is lost. Then we will get incorrect distance results between subsequences.
And what if we need to set a higher value of NThreshold under some scenarios? Many subsequences will lose their wave shape information. We will get incorrect distance values and poor discord results.
In conclusion, I think using the normalization threshold value in the z-normalization step before creating SAX representation is great for countering extreme cases. If we set unsuitable SAX related parameters, it will only influence the efficiency of HOT-SAX (SAX only influence the heuristic order of outer and inner loop), not the final discord results. But using this value in the z-normalization step before calculating actual distances is bad, because it will distort original information of subsequences and leads to poor results. In this case, if we change the value of NThreshold, we can get different discord results.
Therefore I think it's better to have two versions of "znorm" method for these two different situations. (1. z-normalization for SAX 2. z-normalization for actual distance calculation)
I wish my explanation makes sense to you : )