Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various fixes #274

Merged
merged 30 commits into from
Jan 10, 2025
Merged

Various fixes #274

merged 30 commits into from
Jan 10, 2025

Conversation

olejandro
Copy link
Member

@olejandro olejandro commented Jan 6, 2025

This PR makes various fixes, including:

  • exclude i/e options from arithmetic operations
  • move dummy import switch to Config and remove default cost
  • populate default values for limtype and timeslice in uc_t
  • capitalise some attribute names upon upon transforming tfm table variants
  • modify mappings for PRC_REFIT, FLO_FUNC, NCAP_PKCNT, etc.
  • fix type errors
  • improve pattern processing and close Treating '_' as '?' when processing wildcards #275
  • simplify the logics behind processing of wildcards

@olejandro olejandro marked this pull request as ready for review January 8, 2025 06:11
@olejandro
Copy link
Member Author

@siddharth-krishna any chance you could take a look why the unit test is failing? It seems to be working fine locally...

@@ -220,14 +220,14 @@ def compare(
f" {sum(df.shape[0] for _, df in ground_truth.items())} rows"
)

missing = set(ground_truth.keys()) - set(data.keys())
missing = set(ground_truth.keys()).difference(data.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious as to the reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just to have the same approach to this everywhere. :-) I guess, there is no difference with respect to speed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, on a big model, somehow I got an impression that include.difference(exclude) was noticeably faster than include - exclude. Could that be?

Interesting! It might be because you actually replace set(include) - set(exclude) with set(include).difference(exclude) and the latter simply iterates through exclude while the former creates a set out of it, perhaps redundantly? Happy with your changes, was just curious :)

sets = dict()
for action, regex_maker in map.items():
sets[action] = set(
df.filter(regex=regex_maker(pattern), axis="index").iloc[:, 0].to_list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df.filter(regex=regex_maker(pattern), axis="index").iloc[:, 0].to_list()
df.filter(regex=regex_maker(pattern), axis="index").iloc[:, 0]

if acc is None:
return df
return acc.merge(df)
return sets["include"].difference(sets["exclude"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a nice trick to use a dict and a for-loop to replace repeated code, but I'm not sure it improves readability when the loop only executes twice? If there's a chance of adding more iterations to the loop in the future, the new version makes sense to me, otherwise I wonder if something like this is more understandable:

    include = set(df.filter(regex=utils.create_regexp(pattern), axis="index").iloc[:, 0])
    exclude = set(df.filter(regex=utils.create_negative_regexp(pattern), axis="index").iloc[:, 0])
    return include - exclude

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had it exactly that way, before switching to the current version. :-)
Yes, we may need to refactor it a few more times in the future, because the tables actually may include info on how to combine the filters. But that's for the future, since we don't have test cases for it anyway!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, on a big model, somehow I got an impression that include.difference(exclude) was noticeably faster than include - exclude. Could that be?

return df.drop(exclude)

def filter_by_pattern(df: DataFrame, pattern: str) -> set[str]:
"""Filter dataframe index by a regex pattern. Return a set of corresponding items."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code also assumes that the items are in the first column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there should be only one column... Will update the description!

# Substite VEDA wildcards with regex patterns; escape metacharacters.
# ("_", ".") and ("[.]", "_") are meant to apply one after another to handle
# the usage of "_" equivalent to "?" and "[_]" as literal "_".
for substition in ((".", "\\."), ("_", "."), ("[.]", "_"), ("*", ".*"), ("?", ".")):
Copy link
Collaborator

@siddharth-krishna siddharth-krishna Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had #275 on my list but I didn't get around to it yet. Thanks for the fix, it's a smart one!

Btw, perhaps this is a good function to start writing unit tests for, as it's gotten reasonably complex/subtle?

Also, you can simplify this with for old, new in ... :)
Also, I would use a list of tuples over a tuple of tuples when it gets beyond 3 items, because tuples have some strange hardcoded implementation in Python if I remember correctly. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see we have #222 already, I'll put it on my list :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll change it to a list before merging.

@olejandro olejandro merged commit c268df8 into main Jan 10, 2025
2 checks passed
@olejandro olejandro deleted the olex/various-fixes branch January 10, 2025 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Treating '_' as '?' when processing wildcards
2 participants