-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harmonising wildcard processing #209
Comments
@siddharth-krishna @SamRWest what do you think? |
The only potential issue I can see with separating e.g. An alternative idea might be to move the helper functions inside |
Memory can definitely be an issue. Could we e.g., store resolved wildcards as a list / string in in the original table until it is used or would that not help? If choosing to expand them, we could also reduce the size by dropping the records that would otherwise be overwritten. |
The fastest approach I've found (and what I've implemented in The regex lookups are slow, and the DF row iteration is incredibly slow. This minimises the regex lookups and mostly avoids the iteration, making it much faster. There are probably some additional gains to be had by building this lookup table of unique wildcard matches once for wildcards in all tables, as this should eliminate additional duplicates. Yes it'll use more RAM to store, but it'll be far less than the end-result joined/exploded tables, so I doubt it'd a bottleneck. I've got this working for ~TFM_UPD over here but got sidetracked on some other stuff and need to get back to it. I've opened PR #210 for you guys to have a look at. Happy to push on if you think this is the right approach? The remaining performance issue is that it still has to iterate over |
Currently there are 2 transforms dealing with wildcard processing:
process_uc_wildcards
deals with wildcards inuc_t
tables;process_wildcards
deals with wildcards intfm
tables and changes the dataframe with attribute data based on their content.Would it be practical to:
tfm
tables out ofprocess_wildcards
;process_uc_wildcards
andprocess_wildcards
in a single transform?I believe these changes would also allow addressing #153 and #154 easier.
The text was updated successfully, but these errors were encountered: