-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repoID bitmap for speeding up findShard in compound shards #899
Conversation
Note: I still need to add unit tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems really useful to have this bitmap!
Overall question: when you saw a bunch of time spent in findShard
, do you know what exactly was slow? Was it reading the metadata, or going through all the repos to check if one existed?
The reason I ask is that I already optimized the metadata reading part a bit: #826. Before, we were reading all sections in the shard (!!) now we just load the metadata. I am curious if the large instance you saw had this fix, but still saw slowness, or could have been missing it.
toc.go
Outdated
@@ -187,6 +188,8 @@ func (t *indexTOC) sectionsTaggedList() []taggedSection { | |||
{"nameBloom", &unusedSimple}, | |||
{"contentBloom", &unusedSimple}, | |||
{"ranks", &unusedSimple}, | |||
|
|||
{"reposIDsBitmap", &t.reposIDsBitmap}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny comment, it'd be nice to group this with the in-use sections above so it's next to "repos"
build/builder.go
Outdated
// If we are still seeing performance issues, we should consider adding | ||
// some sort of global oracle here to avoid filepath.Glob and checking | ||
// each compound shard. | ||
if !zoekt.MaybeContainRepo(fn, o.RepositoryDescription.ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: what if this were not two separate steps (so callers must remember to check MaybeContainRepo
first), but a single new method like zoekt.ReadMetadataForRepo(fn, repoID)
? That might also let us share work between the two, and avoid opening and mmapping the index file twice (although it's likely not a big deal :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. I have refactored the code.
toc.go
Outdated
@@ -96,7 +96,8 @@ type indexTOC struct { | |||
contentChecksums simpleSection | |||
runeDocSections simpleSection | |||
|
|||
repos simpleSection | |||
repos simpleSection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we bump the index FeatureVersion
or FormatVersion
here? (I'm not 100% on which one 😊 ) That way, far in the future once we've dropped support for older versions, we can know the bitmap is always there, and simplify the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureVersion
seems wrong, because we don't require a reindex.
Not sure about FormatVersion
either. The code is currently in an in-between state. We write and read v16 by default, but compound shards are v17. Introducing v18 now seems confusing.
I am also not sure the distinction between FeatureVersion
and FormatVersion
is really useful. I am always confused about the different versions. Maybe it's worth rethinking the model? 🤔.
I suggest to complete the move to v17 in a separate PR and used that to simplify the logic: We can re-index v16 shards if we encounter them but support reading them for a couple of releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to do it in a separate PR. Also let's catch up about FeatureVersion
/ FormatVersion
more generally, I'd like to understand how to simplify the model.
I'll get back to finishing this PR this week, its just not high priority so wanna prioritize some other work first. |
@keegancsmith let's revive this PR!! (Also sorry about the merge conflicts ...) |
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase. The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind. It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers. Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards. I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).
8d65282
to
04e94b3
Compare
{Name: "sameName", ID: 2}, | ||
{Name: "sameName", ID: 3}, | ||
{Name: "repoB", ID: 2}, | ||
{Name: "repoC", ID: 3}, | ||
}, | ||
{ | ||
{Name: "repoB", ID: 4}, | ||
{Name: "sameName", ID: 5}, | ||
{Name: "sameName", ID: 6}, | ||
{Name: "repoD", ID: 4}, | ||
{Name: "repoE", ID: 5}, | ||
{Name: "repoF", ID: 6}, | ||
}, | ||
}, | ||
expectedShardCount: 1, | ||
expectedRepository: zoekt.Repository{Name: "sameName", ID: 5}, | ||
expectedRepository: zoekt.Repository{Name: "something-else", ID: 5}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fly-by change to make the intent of the test more obvious. Before, the name and id both matched so it was not obvious that we match by id only.
// BenchmarkFindCompoundShard-16 33505 36016 ns/op | ||
// | ||
// Without optimization | ||
// BenchmarkFindCompoundShard-16 76 15568589 ns/op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I guess this answers my question here:
when you saw a bunch of time spent in findShard, do you know what exactly was slow? Was it reading the metadata, or going through all the repos to check if one existed?
index/read.go
Outdated
@@ -649,6 +653,103 @@ func IndexFilePaths(p string) ([]string, error) { | |||
return exist, nil | |||
} | |||
|
|||
// MaybeContainRepo returns true if the shard at path p could contain repoID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment: this should be maybeContainsRepo
. Also, it seems like the method we're encouraging callers to use is actually containsRepo
. So we could move the doc comment there instead (and rework it?)
index/builder_test.go
Outdated
|
||
shard := o.findCompoundShard() | ||
if shard != "" { | ||
b.Fatal("expected emtpy result") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b.Fatal("expected emtpy result") | |
b.Fatal("expected empty result") |
We add a new section to shards which contains a roaring bitmap for quickly checking if a shard contains a repo ID. We then can load just this (small amount) of data to rule out a compound shard. We use roaring bitmaps since we already have that dependency in our codebase.
The reason we speed up this operation is we found on a large instance which contained thousands of tiny repos we spent so much time in findShard that our indexing queue would always fall behind.
It is possible this new section won't speed this up enough and we need some sort of global oracle (or in-memory cache in indexserver?). This is noted in the code for future travellers.
Test Plan: the existing unit tests already cover if this is forwards and backwards compatible. Additionally I added some logging to zoekt to test if older version of shards still work correctly in findShard, as well as if older versions of zoekt can read the new shards.
I haven't quantified the performance improvements yet. Before landing I will generate some synthetic large compound shards and test on my machine the changes in perf for findShard (will update description).Added a benchmark to check the impact. See comments in the code.
Closes https://linear.app/sourcegraph/issue/SPLF-824/zoekt-fast-detection-of-repo-id-in-shard