-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on bulk queries #75
Comments
Yep, I'm not aware of any public API to determine which node (resp. in case of Scylla you would group by shard) the partition key maps to. We just split our queries by partition key, although in our case we don't need to read 200K specific partition keys at once and we read more than one row from partition. One advantage of splitting by partition key is that each partition can return error separately (e.g. in case some partition takes long to process, but that might be irrelevant if you only have one row per partition). It would definitely be interesting to see a benchmark comparing the options. |
I too would like to have the ability to batch my data according to which node/shard my data is going. Any new information regarding this? |
The same problem. I extend gocql driver with this function, which helps me to split keys into shard aware batches. |
Did you benchmark your approach? For what it's worth, I did test my application with similar changes, and found that in my case it didn't help. I stopped using I suspect that ScyllaDB is less efficient at load balancing queries that can vary in size than it is at handling many many concurrent queries. That was 3 years ago though. |
It really depends on kind of load. ScyllaDB is really good at handling concurrent queries but when we have 1,5M/rps with 100-500 different keys (from different partitions) per one query on one insert or read, concurrent queries with single keys is not efficient. We did several experiments and found that ScyllaDB can insert big You should also understand that ScyllaDB is really good at parallelism, so it's important to have a threaded approach to handling queries in your application in order to ensure maximum utilization of ScyllaDB. |
@moguchev - how do you handle individual query failures? |
In insert queries we have a background retry queue, so we are hope that some failed queries will retried with success (of course we monitore such queries) In read queries in real time we can nothing to do with some individual failed queries, so we degrade functionality and notify clients, that they can re-request some values. This is trade off. In some applications it cannot be done. We are lucky so we can afford this strategy. |
i would suggest to close this issue in favor of #200 |
Hello!
(not sure if this is the right place to ask questions - feel free to close if not)
Going over some online resources on Scylla performance, there's a recurring theme, which is:
SELECT ... IN
on reads andBATCH
on writes) instead of doing more parallel queries.Example resources that touch on this:
SELECT ... IN
queries: https://www.scylladb.com/tech-talk/planning-queries-maximum-performance-scylla-summit-2017/ (a bit old, but I think it is still relevant?)BATCH
insert queries: https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/Now, the application I'm working on needs to do large bulk reads and writes, so I want to make sure I'm not adding artificial bottlenecks.
With that in mind, I'm a bit puzzled about how to implement the above recommendations with
gocql
. Ideally the driver would let me split my bulk queries by node, but there does not seem to be a way to do so (?)It seems the closest I can do using
gocql
is to have my application split bulk queries by partition key. But if I'm trying to read, let's say, 200K rows with all different partition keys, this basically means bulk queries will all be split into 200K single-row queries - which basically means I can't use bulk queries in practice - which may be detrimental to my application performance.Given that the driver has access to the information that would be required to split queries intelligently, I was wondering why there is no such facility? I'm not sure whether I am misunderstanding how to best use the driver, or if the driver is currently lacking the feature. It would surely require the driver to expose a rather odd-looking API (as the current gocql API seems pretty abstracted away from such concerns) - but that seems required in order to make bulk queries as efficient as they can be.
I'm completely new to the Scylla ecosystem so apologies if my question seems odd. Advice welcome, and sorry if I missed something obvious :)
Thanks!
The text was updated successfully, but these errors were encountered: