Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on bulk queries #75

Closed
vikmik opened this issue Apr 14, 2021 · 8 comments
Closed

Question on bulk queries #75

vikmik opened this issue Apr 14, 2021 · 8 comments
Labels

Comments

@vikmik
Copy link

vikmik commented Apr 14, 2021

Hello!

(not sure if this is the right place to ask questions - feel free to close if not)
Going over some online resources on Scylla performance, there's a recurring theme, which is:

  • it may be beneficial to do bulk queries (using SELECT ... IN on reads and BATCH on writes) instead of doing more parallel queries.
  • when doing bulk queries, it is better to avoid requiring coordination across multiple nodes (= avoid bulk queries that require data from different nodes)

Example resources that touch on this:

Now, the application I'm working on needs to do large bulk reads and writes, so I want to make sure I'm not adding artificial bottlenecks.

With that in mind, I'm a bit puzzled about how to implement the above recommendations with gocql. Ideally the driver would let me split my bulk queries by node, but there does not seem to be a way to do so (?)
It seems the closest I can do using gocql is to have my application split bulk queries by partition key. But if I'm trying to read, let's say, 200K rows with all different partition keys, this basically means bulk queries will all be split into 200K single-row queries - which basically means I can't use bulk queries in practice - which may be detrimental to my application performance.

Given that the driver has access to the information that would be required to split queries intelligently, I was wondering why there is no such facility? I'm not sure whether I am misunderstanding how to best use the driver, or if the driver is currently lacking the feature. It would surely require the driver to expose a rather odd-looking API (as the current gocql API seems pretty abstracted away from such concerns) - but that seems required in order to make bulk queries as efficient as they can be.

I'm completely new to the Scylla ecosystem so apologies if my question seems odd. Advice welcome, and sorry if I missed something obvious :)

Thanks!

@martin-sucha
Copy link

Yep, I'm not aware of any public API to determine which node (resp. in case of Scylla you would group by shard) the partition key maps to.

We just split our queries by partition key, although in our case we don't need to read 200K specific partition keys at once and we read more than one row from partition.

One advantage of splitting by partition key is that each partition can return error separately (e.g. in case some partition takes long to process, but that might be irrelevant if you only have one row per partition).

It would definitely be interesting to see a benchmark comparing the options.

@Codebreaker101
Copy link

I too would like to have the ability to batch my data according to which node/shard my data is going. Any new information regarding this?

@moguchev
Copy link

moguchev commented Mar 18, 2024

The same problem. I extend gocql driver with this function, which helps me to split keys into shard aware batches.
May be it would be helpful
#164

@vikmik
Copy link
Author

vikmik commented Mar 18, 2024

Did you benchmark your approach? For what it's worth, I did test my application with similar changes, and found that in my case it didn't help. I stopped using SELECT IN ... and BATCH entirely. ScyllaDB is so good at handling concurrent queries that I found no compelling reason to extend the driver code and have to maintain that.

I suspect that ScyllaDB is less efficient at load balancing queries that can vary in size than it is at handling many many concurrent queries. That was 3 years ago though.

@moguchev
Copy link

It really depends on kind of load. ScyllaDB is really good at handling concurrent queries but when we have 1,5M/rps with 100-500 different keys (from different partitions) per one query on one insert or read, concurrent queries with single keys is not efficient.

We did several experiments and found that ScyllaDB can insert big BATCHes efficiently if all keys in batch are corresponded to one host (so strategy is Host Aware). And SELECT IN queries is efficiently when it has less or equal 10 keys corresponded to one shard (so strategy is Shard Aware). That's why I wrote this extension API and uses in our application.

You should also understand that ScyllaDB is really good at parallelism, so it's important to have a threaded approach to handling queries in your application in order to ensure maximum utilization of ScyllaDB.

@mykaul
Copy link

mykaul commented Apr 15, 2024

@moguchev - how do you handle individual query failures?

@moguchev
Copy link

In insert queries we have a background retry queue, so we are hope that some failed queries will retried with success (of course we monitore such queries)

In read queries in real time we can nothing to do with some individual failed queries, so we degrade functionality and notify clients, that they can re-request some values. This is trade off. In some applications it cannot be done. We are lucky so we can afford this strategy.

@dkropachev
Copy link
Collaborator

i would suggest to close this issue in favor of #200

@sylwiaszunejko sylwiaszunejko closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants