Skip to content

Data Profile for a time series using Singular Spectrum Analysis

Rajiv Sambasivan edited this page Apr 20, 2024 · 1 revision

If you follow the path for the process of classifier design from Fukunaga, this material is subsumed in that and you can stop reading this recipe. If on the other hand, you choose to stay with your non-parametric method of choice (neural network, trees, kernel methods, splines, etc) and use that for prediction and stop there, then this post has some content related to this approach. Developing a data profile for your task similar to a feature profile is critical in my view. Some non-parametric methods like xgboost let you get away with knowing pretty much nothing about your data (this is not a criticism of xgboost, far from it, it is really powerful). In my view, if you are doing modeling in a business setting, then developing an intuition about your data can be advantageous. This will help in developing better models because you know the characteristics of the data that your model must account for. This is what a data profile does. The data profile captures the critical characteristics of the data about your task that your modeling method must account for. I have used a time series-specific task here to illustrate. I have used the Producer Price Index time series. If we want to forecast it (supervised) or cluster it (unsupervised), then it helps to know what components the underlying series is made up of first. I have used a time series decomposition method (please see Nina Golyadina's fantastic book) to decompose the time series into a trend and oscillatory component. There are other decomposition methods, STL, and XTS (see Hyndman’s book for the details). The point here is the data profile gives these components. This tells me to forecast this series well, my modeling method must be able to capture the type of trend behavior I see and the oscillatory behavior I see. Those familiar with the notion of additive models should be able to make an immediate connection. The tool you use for profiling the data is unfortunately going to be domain and application-dependent. This kind of dependency is true for other activities in data science such as feature engineering. You cannot use a time series decomposition method on cross-sectional data. The CART tree can be used to divide your dataset into regions that have similar predictor response behavior. Sourish and I have a paper on this (https://link.springer.com/article/10.1007/s41060-018-0146-6). You can then profile these regions independently to characterize the modeling features that are needed for each region of your dataset. There are probably other ways to do this for cross-sectional data, this is one method that I have used. I prefer a data profile along with tools for model explanation, rather than just having a tool for model explanation.

For the data profile development, please see this notebook For the actual data profile, please see: this report