-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preload xapian database #958
base: main
Are you sure you want to change the base?
Conversation
This struct will help store Xapian's metadata.
We don't need it. No need to store it.
Note this commit introduce a race condition. As we now open only on xapian db per archive, we may have different searchers using the same xapian db in the same time.
We will need to lock all db mutex in the same time when doing multizim search. And we must avoid dead locks in the process. Std lib provides us a `std::lock()` [1] function to lock several mutex in the same time. However, it is a variatic function and it doesn't take a list of mutex to lock. So we havet to implemented ourselves. But we don't want to re implement an error prone algorithm. To do so, we introduce a `MultiMutex` (a linked list) and we (ab)use the fact that `std::lock()` is a template taking any object "Lockable" [2] [1] https://en.cppreference.com/w/cpp/thread/lock [2] https://en.cppreference.com/w/cpp/named_req/Lockable
99506fd
to
da0ad2f
Compare
You did not mentioned it, but did you ran benchmarking ( I think it would be important to make same measurements on a Raspberry Pi 3 (or even a Raspberry Pi Zero W) with the ZIM on an SD card. I imagine numbers will be significantly different, and the benefit of preloading Xapian database could be more visible since this is kinda the "worst case scenario" (at least it is the scenario of original issue where we had bad numbers in Cold vs Hot cases). Maybe testing on a real Android device (e.g. low-end tablet) would be fun as well (or not ^^ I imagine you don't have sufficient tooling for this just like I do). |
c23aca7
to
cea50cf
Compare
We finally lock all xapian Db mutexes in the same time before when access them.
cea50cf
to
d2e4ae7
Compare
Indeed. I have a pi3, I will run the same test on it. And I confirm I have no android device, I can send you the binaries if you have one and want to test. |
I have no idea about how to run a binary on an Android device. Is it just a matter of starting an adb shell to the device and starting the binary? If so (or something straightforward like this), then yes I can happily help with this. |
Here the numbers on a Pi3:
While the numbers are obviously greater, the same schema appears. Opening the xapian db is about 131ms, searching (xapian only) is about 850ms. So we spend around 1s in xapian and 9s in zim itself. |
Yes, it should be as simple as it sound. |
Please send me the binary and the ZIM file then, I will give it a try (mostly for curiosity, since numbers on Pi3 seems to speak for themselves) |
I cannot because of this : kiwix/kiwix-build#808 |
@veloman-yunkan Can you please do the necessary to fix kiwix-build? |
@kelson42 I will |
@mgautierfr Does the documentation (readthedocs) has been changed? It is not very clear to me right now how to use this feature and what has changed in term of interface for the developer. |
It will be once merged. readthedocs documentation is generated from comments in source code. You can see the rtd for the pr at : https://libzim--958.org.readthedocs.build/en/958/ Saying that, I've just realize that the doc is generated as if libzim was not compiled with xapian. So some methods are not present (especially
User have to call |
@mgautierfr From what I see there, the preloading of the Xapian indexes is not and option to specify at ZIM archive. |
Yes, because of this :
|
|
@mgautierfr preloading xapian index has to be an option of the opening primitives, not a dedicated public API. |
d2e4ae7
to
a393cae
Compare
Now m_direntLookup is created at fileimpl creation, not at first (user) access to the dirents. This is kind of useless to delay the dirent_lookup creation until the first access as user will need to create it to use the archive. For the rare case when this is not needed (search only, iter efficient), user can now simply deactivate the dirent lookup.
72276cf
to
49e9136
Compare
On Windows, static member are not exported in dlls. So we don't have a value to use as default for preloadConfig.
Fix #617 (preloading xapian database)
Numbers do not show a strong win (not a lost neither) but this PR also introduce nice improvement as side effect and so is a good preparatory work for next project about multi zim and multi lang issues.
Globally this PR does:
XapianDb
is about a specific database stored in a zim file.InternalDataBase
is about a database we search in (which "include"XapianDb
s)Archive::preloadXapianDb
to explicitly load the xapian db.The numbers
Globally, what is taking time is loading the data from fs into memory. And this is the purpose of this PR to load this data upfront.
So there is two use case to evaluate: a cold one (data is not loading) and warm one (data is loaded).
On top of that, locating the xapian db or using the search results imply loading entries from zim file to get the title. Accessing zim dirent will activate our dirent lookup cache and we will need to populate it (and we know it can take some time, that was the purpose of #950 to be able to deactivate it)
Numbers are obtained by instrumenting zimsearch and running
zimsearch wikispecies_en_all_maxi_2020-06.zim home
(and taking the mean of 3 runs).Numbers are in milliseconds. Main number is cold. Number is parentheses is warm.
Search (w getTitle)
is the time to do the search and use it. It include the time to get the title (from zim dirents) of all results (getTitle
).Search (wo getTitle)
is simply the difference between both, this is the real time taken by xapian to do the search.From
Preloading
table, we can see that loading the xapian DB itself is pretty fast, only 2ms.When dirent lookup is activated, it take 155ms but this is mostly the time to populate the dirent lookup cache. We found those numbers in
Create search
(no preload) as we load the xapian db (and populate the dirent lookup cache) when we start the search.The search performance itself is not really changed. Warm run is a bit faster but not a game changer.
Exploiting the search result (
getTitle
) also shows mitigated results:For the
Total
, preloading and dirent lookup cache also shows strange results (preloading should not impact the total, dirent lookup cache should have the same impact, what ever preloading is)At the end, it seem that most of the time spend during a search is about loading the dirent from zim file (either at dirent lookup population or at get title), not from xapian.
As said in the introduction, this PR is still interresting for its side effects: