Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preload xapian database #958

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Preload xapian database #958

wants to merge 15 commits into from

Conversation

mgautierfr
Copy link
Collaborator

@mgautierfr mgautierfr commented Feb 28, 2025

Fix #617 (preloading xapian database)

Numbers do not show a strong win (not a lost neither) but this PR also introduce nice improvement as side effect and so is a good preparatory work for next project about multi zim and multi lang issues.

Globally this PR does:

  • Move xapian metadata out of internalDataBase
  • Separate xapian database (and metadata) from internalDataBase. XapianDb is about a specific database stored in a zim file. InternalDataBase is about a database we search in (which "include" XapianDbs)
  • Move opening of xapian database in fileImpl (on the archive side instead of on the search side)
  • Open only on xapian db per archive. This as the side effect that searching is now thread safe. (We introduce mutex lock system to do so)
  • We introduce a new public method Archive::preloadXapianDb to explicitly load the xapian db.

The numbers

Globally, what is taking time is loading the data from fs into memory. And this is the purpose of this PR to load this data upfront.
So there is two use case to evaluate: a cold one (data is not loading) and warm one (data is loaded).
On top of that, locating the xapian db or using the search results imply loading entries from zim file to get the title. Accessing zim dirent will activate our dirent lookup cache and we will need to populate it (and we know it can take some time, that was the purpose of #950 to be able to deactivate it)

Numbers are obtained by instrumenting zimsearch and running zimsearch wikispecies_en_all_maxi_2020-06.zim home (and taking the mean of 3 runs).
Numbers are in milliseconds. Main number is cold. Number is parentheses is warm.
Search (w getTitle) is the time to do the search and use it. It include the time to get the title (from zim dirents) of all results (getTitle). Search (wo getTitle) is simply the difference between both, this is the real time taken by xapian to do the search.

Opening no preload preload  
no cache 13 (1) 13 (1)
cache 13 (1) 13 (1)
Preloading no preload preload  
no cache 0 2 (0.6)
cache 0 155 (8)
Create search no preload preload  
no cache 2 (0.7) 0.1 (0.1)
cache 170 (10) 0.1 (0.1)
Search (w getTitle) no preload preload  
no cache 224 (21) 250 (20)
cache 147 (19) 107 (16)
getTitle no preload preload  
no cache 209 (9) 235 (9)
cache 132 (9) 94 (8)
Search (wo getTitle) no preload preload  
no cache 15 (11) 14 (11)
cache 15 (10) 13 (8)
Total no preload preload  
no cache 240 (22) 265 (22)
cache 330 (31) 276 (25)

From Preloading table, we can see that loading the xapian DB itself is pretty fast, only 2ms.
When dirent lookup is activated, it take 155ms but this is mostly the time to populate the dirent lookup cache. We found those numbers in Create search (no preload) as we load the xapian db (and populate the dirent lookup cache) when we start the search.

The search performance itself is not really changed. Warm run is a bit faster but not a game changer.

Exploiting the search result (getTitle) also shows mitigated results:

  • dirent lookup improve dirent search
  • preload of xapian db has different impact depending of dirent lookup activated or not. However, it should have no impact at all.

For the Total, preloading and dirent lookup cache also shows strange results (preloading should not impact the total, dirent lookup cache should have the same impact, what ever preloading is)

At the end, it seem that most of the time spend during a search is about loading the dirent from zim file (either at dirent lookup population or at get title), not from xapian.
As said in the introduction, this PR is still interresting for its side effects:

  • Thread safe search
  • Only one xapian db open per zim file

This struct will help store Xapian's metadata.
We don't need it. No need to store it.
Note this commit introduce a race condition.

As we now open only on xapian db per archive, we may have different
searchers using the same xapian db in the same time.
We will need to lock all db mutex in the same time when doing multizim
search. And we must avoid dead locks in the process.

Std lib provides us a `std::lock()` [1] function to lock several mutex
in the same time. However, it is a variatic function and it doesn't take
a list of mutex to lock.

So we havet to implemented ourselves. But we don't want to re implement
an error prone algorithm. To do so, we introduce a `MultiMutex` (a
linked list) and we (ab)use the fact that `std::lock()` is a template
taking any object "Lockable" [2]

[1] https://en.cppreference.com/w/cpp/thread/lock
[2] https://en.cppreference.com/w/cpp/named_req/Lockable
@benoit74
Copy link

You did not mentioned it, but did you ran benchmarking (The numbers) on your beefy dev machine?

I think it would be important to make same measurements on a Raspberry Pi 3 (or even a Raspberry Pi Zero W) with the ZIM on an SD card. I imagine numbers will be significantly different, and the benefit of preloading Xapian database could be more visible since this is kinda the "worst case scenario" (at least it is the scenario of original issue where we had bad numbers in Cold vs Hot cases). Maybe testing on a real Android device (e.g. low-end tablet) would be fun as well (or not ^^ I imagine you don't have sufficient tooling for this just like I do).

We finally lock all xapian Db mutexes in the same time before when
access them.
@mgautierfr
Copy link
Collaborator Author

Indeed. I have a pi3, I will run the same test on it. And I confirm I have no android device, I can send you the binaries if you have one and want to test.

@benoit74
Copy link

I can send you the binaries if you have one and want to test

I have no idea about how to run a binary on an Android device. Is it just a matter of starting an adb shell to the device and starting the binary? If so (or something straightforward like this), then yes I can happily help with this.

@mgautierfr
Copy link
Collaborator Author

mgautierfr commented Feb 28, 2025

Here the numbers on a Pi3:

Opening no preload preload  
no cache 2806 (299) 2800 (299)
cache 2866 (299) 2810 (298)
Preloading no preload preload  
no cache 0 131 (4)
cache 0 1206 (25)
Create search no preload preload  
no cache 145 (4) 6 (0.1)
cache 1229 (25) 6 (0.1)
Search (w getTitle) no preload preload  
no cache 7801 (368) 7754 (368)
cache 6602 (352) 6549 (364)
getTitle no preload preload  
no cache 7015 (176) 6972 (176)
cache 5597 (161) 5578 (172)
Search (wo getTitle) no preload preload  
no cache 787 (191) 781 (191)
cache 1005 (191) 880 (192)
Total no preload preload  
no cache 10752 (672) 10693 (671)
cache 10698 (676) 10482 (688)

While the numbers are obviously greater, the same schema appears. Opening the xapian db is about 131ms, searching (xapian only) is about 850ms. So we spend around 1s in xapian and 9s in zim itself.

@mgautierfr
Copy link
Collaborator Author

Is it just a matter of starting an adb shell to the device and starting the binary ?

Yes, it should be as simple as it sound.

@benoit74
Copy link

benoit74 commented Mar 2, 2025

Yes, it should be as simple as it sound.

Please send me the binary and the ZIM file then, I will give it a try (mostly for curiosity, since numbers on Pi3 seems to speak for themselves)

@mgautierfr
Copy link
Collaborator Author

Please send me the binary

I cannot because of this : kiwix/kiwix-build#808

@kelson42
Copy link
Contributor

kelson42 commented Mar 3, 2025

@veloman-yunkan Can you please do the necessary to fix kiwix-build?

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Can you please do the necessary to fix kiwix-build?

@kelson42 I will

@kelson42
Copy link
Contributor

kelson42 commented Mar 4, 2025

@mgautierfr Does the documentation (readthedocs) has been changed? It is not very clear to me right now how to use this feature and what has changed in term of interface for the developer.

@mgautierfr
Copy link
Collaborator Author

mgautierfr commented Mar 4, 2025

Does the documentation (readthedocs) has been changed?

It will be once merged. readthedocs documentation is generated from comments in source code. You can see the rtd for the pr at : https://libzim--958.org.readthedocs.build/en/958/
Once merged, it will become the "main" doc.

Saying that, I've just realize that the doc is generated as if libzim was not compiled with xapian. So some methods are not present (especially Archive::preloadXapianDb here). But it is an issue (#959) to handle separately from this PR.

It is not very clear to me right now how to use this feature and what has changed in term of interface for the developer.

User have to call Archive::preloadXapianDb to ... preload the xapian db.
Other change is that now search is thread safe so user don't have to protect search related calls with its own mutex. It is a relax on a usage constraints but nothing changed on the API itself.

@kelson42
Copy link
Contributor

kelson42 commented Mar 4, 2025

@mgautierfr From what I see there, the preloading of the Xapian indexes is not and option to specify at ZIM archive.

@mgautierfr
Copy link
Collaborator Author

Yes, because of this :

Saying that, I've just realize that the doc is generated as if libzim was not compiled with xapian. So some methods are not present (especially Archive::preloadXapianDb here). But it is an issue (#959) to handle separately from this PR.

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Can you please do the necessary to fix kiwix-build?

@kelson42 I will

Done

@kelson42
Copy link
Contributor

kelson42 commented Mar 5, 2025

@mgautierfr preloading xapian index has to be an option of the opening primitives, not a dedicated public API.

Now m_direntLookup is created at fileimpl creation, not at first (user)
access to the dirents.

This is kind of useless to delay the dirent_lookup creation until the
first access as user will need to create it to use the archive.

For the rare case when this is not needed (search only, iter efficient),
user can now simply deactivate the dirent lookup.
On Windows, static member are not exported in dlls.
So we don't have a value to use as default for preloadConfig.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed-up Xapian searches by preloading indexes
4 participants