-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support load per-iteration replacement of NamedSPI #14275
base: main
Are you sure you want to change the base?
Conversation
Is it that this switches from "first one wins" to "last one wins"? |
Good q. Out of the box, the default remains unchanged - first one wins. But if overridden, allows subsequent providers to conditionally replace previous ones (in the same iteration). Does this help? Or have I missed your point. |
checkServiceName(name); | ||
services.put(name, service); | ||
// unless the later-found service allows to replace the previous. | ||
var prevNew = newServices.get(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the current behaviour is the same service is registered multiple times? I guess only one will win, but it's not deterministic which one?
I was wondering if the replace default behaviour introduces a potential breaking behaviour for existing users. I guess though that is granted if we want to make things deterministic: there are things that may have worked almost by accident before this change, that now require overriding replace explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the current behaviour is the same service is registered multiple times?
First one wins. Subsequent ones are ignored. As it was prior to this proposed change.
... it's not deterministic which one?
For applications deployed on the classpath it is deterministic. Since the classpath is an ordered sequence of directories and jars, the order of service finding is: 1) the order of the classpath - earlier jars/dirs are encountered before later ones, and 2) the order of services defined in the META-INF/services
. So applications and services on the classpath are deterministic.
For applications deployed as modules (as Elasticsearch is), the module layer is searched for modules that provides
the required service. "The ordering of modules in same class loader, or the ordering of modules in a module layer, is not defined." [1]. This is where the non-determinism is coming from.
I was wondering if the replace default behaviour introduces a potential breaking behaviour for existing users.
By default, behaviour remains unchanged. The default value retuned by replace
is false
- "do not replace". Services that are provided by lucene modules/jar will not replace or override that of the user. And since this is new, so the replace
method returns false
, user services will not replace lucene services.
I guess though that is granted if we want to make things deterministic: there are things that may have worked almost by accident before this change, that now require overriding replace explicitly?
For modules you are exactly right. If it works today with modules, then it is a little fragile and non-deterministic.
Going forward, services provide by Lucene jars/modules should return replace
false
, that way users can programmatically select their service over the one provided by Lucene, irrespective of module or classpath ordering. However, classpath ordering will be affected by this, if a user provides a replace
true
. But this is exactly what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the detailed explanation!
@jpountz given the connection of this PR with completion FST, do you have opinions here? |
Anyone else ? @uschindler ? I think that this is quite solid, and while the issue we're facing in Elasticsearch is because we deploy as modules, it may not be that widely encountered, the changes here do improve the usage on the classpath too. |
Did you think about this one, too: https://github.com/apache/lucene/blob/8e68ed22614dc7841ebea94d3e66561ceb74d25e/lucene/core/src/java/org/apache/lucene/analysis/AnalysisSPILoader.java I think this is fine as it preserves previous behaviour. I still don't get how this improves Elasticsearch when it loads codecs, as existing behaviour does not change. |
Now I understand how this is expected to work. Elasticsearch will return true for the SPI impl. Maybe we should also try to allow different orders for the active discovery. Maybe we should replace the Boolean by an priority integer and allow prios to replace. In that case we could load all classes first, then sort them and then apply changes in order and keep highest priority. |
But basically the whole idea here is to allow to replace codecs, which is in reality not wanted at all. If you want a different codec, name it differently. So you am not fully happy with this. The SPI mechanism was never thought to replace existing impls at runtime or by ordering of classpath. I know we had this at beginning for testing purposes on the transition to codecs in Lucene 4, but it's an anti-pattern. |
Yeah, I can see your point. What unsettles me a little about the proposed change is the "weight" that it imposes on this simple SPI interface for a somewhat niche issue. That said, there is currently no way to even ensure that the Lucene provider is the one that gets found. For example, in all my experiments in Elasticsearch, the ES provider has always been found. But I can see from the code that it is not always guaranteed to be. If we deployed on the classpath, then I'd just ensure the order there. So if this is not the right API point here, I do think that it is worth doing something. For example, I did start to consider your previous suggestion - priority order, but need to rethink if there is something "lighter weight" that could be done. |
I will throw in a real usecase that gives us a bit of headache: completion fields. All the existing codecs load them on heap, and we want to make a switch to load them off heap in certain situations. We could do so with a new codec, but that would only affect newly indexed data, and that is somehow weird as in this specific case it's only a matter of how data gets loaded, nothing to do with how it gets stored. Looks like loading off heap for existing data would only be possible via horrible hacks, if at all? |
Hi,
I full agree that this is a problem. But in general that should NOT be linked to the SPI rsolving. SPI is only there to load the correct codec to actually decode the index format. The decission of that should be done on heap or off heap is orthogonal. Maybe it is an XY problem: If Elasticserach wants to exchange a codec and replace it by an own implementation that loads stuff on on/off-heap or uses native code, this could be a requirement we have to decide. With SPI alone, this won't work (unless you write some extra codec version into the index files). ES could do this, but it would make those indexes harder to maintain. I think we should differentiate:
How to make those configuration settings? One solution previously used in Lucene was to have codec parameters while writing (by instantiating custom versions of the codec with different settings but same name) - e.g., stored fields used this to enable/disable compression. To read, the only way to do this was system properties or static setters. I don't like that. My proposal would be: Let's add some key-value pairs of "codec options" like done in Analyzers, that can be passed as part of the IndexWriterConfig (while writing) or passed to DirectoryReader (as IndexReaderConfig => just a plain map). Maybe the keys should be standardized to take the codec name, followed by "." and then a key name. This map is taken by DirectoryReader.open() and saved as instance field in DirectoryReader and passed to each codec when requesting instances of fields, vectors, stored fields,... The codec can lookup then ask for the property by their codec name and key and apply it to its configuration. This could be used straigth already to configure encoding or compression of stored fields (while writing) or if the FST should be loaded on heap or off-heap. Both need no different index format, it just selects options how to handle encoding or decoding. |
Absolutely, we had similar ideas too - so not totally crazy! ;-) . This would be a far better solution to the underlying requirement, than the SPI replacement that I suggested in this PR. |
I agree with everything you wrote above @uschindler ! I can try and update my existing PR targeted at suggestion fields (#14270), following your suggested approach. The PR currently has a new static setter, and I also don't like that at all :) |
Sorry for the late reply. If we want to allow configuring how a codec gets loaded at read time, I agree with @uschindler 's suggestion of doing it via options passed to But I'm still questioning if there's actually a use-case for allowing something to be loaded either on-heap or off-heap in our codecs. For all examples that come to mind, I would rather like Lucene to make the call than give users the option. In the case of the completion postings format, maybe we should just always load the FST off-heap? |
I see where you are going, this would be pragmatic, but I wonder next what happens if anyone would like to switch that back to on-heap, I don't know how common that would be, but I still feel like it is odd to require maintaining a custom codec only to affect the loading. I guess I need to see how much complexity the solution we discussed above requires adding, because that matters too. |
I do think that this is more generally useful, than just the particular use case of on or off -heap FST in completion postings.
I really like this idea too, and think that it's worth exploring. |
I'm curious of what other use-cases you have in mind. For reference, we used to allow configuring whether to load the FST on or off-heap for the terms index before we removed the option in Lucene 8.5 9733643#diff-53b0b5ab08e5c5c74639ced26c691039d2cf91dd2f5e4b3e065eaa1a1e6a9108. Before that, there were concerns that doc values (off-heap) would be slower than FieldCache (on-heap) but nowadays, nobody would suggest going back to FieldCache or loading doc values on heap, even just as an opt-in. Likewise, even though page faults are challenging with KNN vectors, nobody is suggesting that they should be loaded on-heap rather than off-heap. Why do we think that completion fields are different and should support being loaded on-heap? |
I opened #14364 to make the suggested change to the completion postings format, let me know what you think. |
I've been thinking a bit more about this. The two potential use-cases that come to mind are the following:
As discussed above, I want codecs to own the way memory is managed, so I don't like giving options to customize it. Regarding native optimizations, Lucene core is unlikely to ever include native extensions, so I can see an argument for it given the number of limitations of the vector API that we already found. But then I don't think that codecs are the right extension point for this sort of things, they are too high-level and would force re-implementing (copy-pasting in practice) the whole read-side logic, just optimizing the relevant bits. I would rather like it to look like |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
This commit adds support load per-iteration replacement of NamedSPI. The primary motivation for this change is to support deterministic SPI loading when deploying Lucene as a module.
When deploying as on the class path the service loading follows the order of the jars on the classpath. When deploying as modules, services in modules are found before those in unnamed modules, but the order of services within a module layer is undefined.
The API change proposed,
NamesSPI::replace
, allows subsequent service providers found to optionally replace ones found during the same iteration. This is sufficient when deploying just a couple of service provider implementation, or to effectively replace one provided by Lucene itself, e.g. elastic/elasticsearch#123011In fact, we think that this may be a sufficient model to provide customisation to certain codecs, just like in
#123011
(where we changed CompletionPostingFormats to use off-heap FST load mode ).