Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up the loading of large tables #2026

Merged
merged 1 commit into from
Feb 3, 2025
Merged

Speed up the loading of large tables #2026

merged 1 commit into from
Feb 3, 2025

Conversation

visr
Copy link
Member

@visr visr commented Jan 31, 2025

This fixes a performance issue that @rbruijnshkv encountered trying to initialize a model with a Basin / time column of 6 million rows, spread over 1000 Basin nodes. It spent around 1-2 seconds per Basin node on this line. time is a StructVector, which stores columns as vectors. By broadcasting getfield we iterated over rows generating BasinTime structs and then taking one field, which works but is much slower than just taking out the field that is already a vector.

The general recommendation for such large tables is to not store them in the model database but a separate Arrow file like here: https://github.com/Deltares/Ribasim/blob/v2025.1.0/python/ribasim_testmodels/ribasim_testmodels/basic.py#L210. Doing this shrank the database from 400 to 100 MB, and also sped up initialization. This should help both formats though.

Copy link
Member

@evetion evetion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these kinds of improvements 🎉

@visr visr merged commit c75888a into main Feb 3, 2025
25 checks passed
@visr visr deleted the overly-broad-cast branch February 3, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants