Using an SQL data source (Design Philosophy?) #679
-
Hi all! I've been pondering how to design my River implementation and I've settled on using a time series database as my stream. It also occurred to me that I could store things like predictions, metrics, etc in separate tables so that I can analyze performance between runs with different parameters models, etc to see what performs best on the data I'm working with. Therefore, this is more of a philosophical question than a technical one. What are some pros and cons of integrating River with a database back end for storing the data stream and the output of the algorithm? Has anyone in the community here done this before? If so, what worked and what didn't? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
I think there is no established way to do things. Whatever works for you is fine. In my opinion, what is really important is to have a proper dataset to train and evaluate on. I say train and evaluate because, as you may know, a single dataset can be used for both tasks when you do online learning. It's what we call progressive validation. Therefore, the ideal setup is to have a dataset where you know the arrival times of the labels. This way, you can simulate a production scenario by showing the Now, the way you obtain this dataset is entirely up to you. Using a database is fine. Writing your predictions and metrics to a database is fine too. In my experience, data scientists don't use databases too much for prototyping and instead go for files. I would say using a database is cleaner. But it's really a matter of taste and habit. Again, what's really important is this idea of having a good dataset to train and evaluate on. In production, you might not be using a relational database, but instead a message queue. See this great article for more information. Note that at some point I would like to develop a tool on top of River to assist in developing and productionalising online models. At the moment, we're lacking in patterns and tooling that we can all align on. |
Beta Was this translation helpful? Give feedback.
I think there is no established way to do things. Whatever works for you is fine.
In my opinion, what is really important is to have a proper dataset to train and evaluate on. I say train and evaluate because, as you may know, a single dataset can be used for both tasks when you do online learning. It's what we call progressive validation. Therefore, the ideal setup is to have a dataset where you know the arrival times of the labels. This way, you can simulate a production scenario by showing the
x
andy
s to the model in the exact same order as what happened in production. I wrote a blog post on this here.Now, the way you obtain this dataset is entirely up to you. Using a database is fin…