-
Notifications
You must be signed in to change notification settings - Fork 8
Repo-level and pattern-level configuration in .thetaconfig #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
|
||
def get_checkpoint_handler(checkpoint_type: Optional[str] = None) -> Checkpoint: | ||
"""Get the checkpoint handler either by name or from an environment variable. | ||
def get_checkpoint_handler(checkpoint_path: str) -> Checkpoint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function is doing too much now, based on SoC, this function should really be about translating a checkpoint name into a class. We should probably have the path -> checkpoint type be done externally and then the result is passed in.
return checkpoint_type or utils.EnvVarConstants.CHECKPOINT_TYPE | ||
repo = git_utils.get_git_repo() | ||
checkpoint_path = git_utils.get_relative_path_from_root(repo, checkpoint_path) | ||
thetaconfig = config.ThetaConfigFile(repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to result in reading the config file a bunch of time yeah? Should we refactor to something like the config file is read once and then the config object is used?
# TODO we need a better way of keeping track of configuration at the repository level | ||
# For LSH configuration, once it is set for a repository, changing it should be handled with care | ||
repo = git_utils.get_git_repo() | ||
thetaconfig = config.ThetaConfigFile(repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to make sure that we decouple a lot of these functions from needing there to be an actual config, otherwise they get really hard to reuse or use in tests.
We probably want things like the signature size, atol, etc to be parameters where None values result in looking it up in the config?
# N.b. we use a fixed seed so that every instance of RandomPool has the same set of random numbers | ||
rng = Generator(MT19937(seed=42)) | ||
pool = rng.normal(size=EnvVarConstants.LSH_POOL_SIZE) | ||
pool = rng.normal(size=thetaconfig.repo_config.lsh_pool_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pool size should probably be a parameter where a None
results in a config look up, the same way signature size is
@@ -26,7 +26,8 @@ class Update(metaclass=ABCMeta): | |||
|
|||
name: str = NotImplemented # The name used to lookup the plug-in. | |||
|
|||
def __init__(self, serializer: params.Serializer, *args, **kwargs): | |||
def __init__(self, path: str, serializer: params.Serializer, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base class never needs the path right? Can we move this to IncrementalUpdate init like the update_data
parameter?
class GitAttributesFile: | ||
def __init__(self, repo: git.Repo): | ||
self.repo = repo | ||
self.file = os.path.join(repo.working_dir, ".gitattributes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as below wrt to holding onto the file name.
pattern_attrs.add_attribute("filter=theta") | ||
pattern_attrs.add_attribute("merge=theta") | ||
pattern_attrs.add_attribute("diff=theta") | ||
pattern_found = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment about how we don't have a break
here because gitattribute shadow behavior means we want to update every line where the pattern appears?
def __init__(self, serializer: params.Serializer, update_data: str = ""): | ||
super().__init__(serializer) | ||
def __init__(self, path: str, serializer: params.Serializer, update_data: str = ""): | ||
super().__init__(path, serializer) | ||
self.update_information: Dict[str, np.ndarray] = None | ||
self.update_names: utils.Trie = None | ||
# Flatten the side-loaded information into a of string keys to arrays. | ||
if update_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move the reading of sideloaded information out of the IncrementalUpdate class and have it expect an Optional[Dict[str, np.ndarray]]
instead? That would remove the need to pass path
into the updates everywhere right? Updaters created during the recursive part get their parameter data from git so they don't need to get passed update_data
@@ -71,15 +71,8 @@ def __get__(self, obj, objtype=None): | |||
|
|||
|
|||
class EnvVarConstants: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should go full one way or the other for configuration? Right now somethings are in the file and some things are env variables. My original though was that the config file would just populate env-vars with default values so all lookups are via envvars.
Do we want some new collection of values on the config object that are like runtime config values? That is where things like update type or update data path would live? They could have defaults in the config but get overwritten by EnvVars, kinda like the jax config object?
self.repo_config, self.pattern_configs = ThetaConfigFile.read(self.file) | ||
|
||
@classmethod | ||
def read(cls, file) -> Tuple[RepoConfig, List[PatternConfig]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like with this approach we are loosing the ability to tweak values from and EnvVar at runtime? We definitely want to be able to do that for things like max concurrency and I could see wanting to do that with other things too,
This change is based on a part of r-three#214 where the test for if a file is tracked by Git-Theta is abstracted into a function. However it is implemented differently as it now handles the subtleties of what attribute line is active at a given time. Only the final line that a path matches is used to set attributes and only the last entry for some key is used. This is now respected.
Currently, when a pattern in `.gitattributes` matches the path in `git-theta track`, the theta attributes (filter, merge, diff) are added to that line. This can cause issues as it may result in tracking more files with git-theta than expected. However, just adding a new line that is an exact match for the path to the end of the gitattribute file is not correct either. The *last* attribute line in the file is the one used by Git. This could result in `git-theta track` removing an some other attribute that is set for that file. This PR updates the way that git attributes are set. If there are already git attributes set to non-theta values for the file that are used by git-theta (filter, merge, diff) an error is raised. If these attributes are all set to git-theta, no new entry is added, even when the entry is a pattern match instead of an exact match. If the new file has no attributes set, or attributes that don't overlap with git-theta, then a new entry is added. Non-overlapping attributes are copied down if they were set before. When a line has an attribute set multiple times, the *last* one is used so we can add the theta filters at the end to override any previous ones. Added a `is_theta_tracked` function similar to the one from r-three#214 where the test for if a file is tracked by Git-Theta is abstracted into a function. However it is implemented differently as it now handles the subtleties of what attribute line is active at a given time. Only the final line that a path matches is used to set attributes and only the last entry for some key is used. This is now respected.
Currently, when a pattern in `.gitattributes` matches the path in `git-theta track`, the theta attributes (filter, merge, diff) are added to that line. This can cause issues as it may result in tracking more files with git-theta than expected. However, just adding a new line that is an exact match for the path to the end of the gitattribute file is not correct either. The *last* attribute line in the file is the one used by Git. This could result in `git-theta track` removing an some other attribute that is set for that file. This PR updates the way that git attributes are set. If there are already git attributes set to non-theta values for the file that are used by git-theta (filter, merge, diff) an error is raised. If these attributes are all set to git-theta, no new entry is added, even when the entry is a pattern match instead of an exact match. If the new file has no attributes set, or attributes that don't overlap with git-theta, then a new entry is added. Non-overlapping attributes are copied down if they were set before. When a line has an attribute set multiple times, the *last* one is used so we can add the theta filters at the end to override any previous ones. Added a `is_theta_tracked` function similar to the one from r-three#214 where the test for if a file is tracked by Git-Theta is abstracted into a function. However it is implemented differently as it now handles the subtleties of what attribute line is active at a given time. Only the final line that a path matches is used to set attributes and only the last entry for some key is used. This is now respected. Git attributes support the "unsetting" of attributes (e.g. "-diff"), these attributes are correctly copied around, but they aren't currently logically checked to see if they intersect with git-theta attributes.
* Update how `git-theta track ${path}` works. Currently, when a pattern in `.gitattributes` matches the path in `git-theta track`, the theta attributes (filter, merge, diff) are added to that line. This can cause issues as it may result in tracking more files with git-theta than expected. However, just adding a new line that is an exact match for the path to the end of the gitattribute file is not correct either. The *last* attribute line in the file is the one used by Git. This could result in `git-theta track` removing an some other attribute that is set for that file. This PR updates the way that git attributes are set. If there are already git attributes set to non-theta values for the file that are used by git-theta (filter, merge, diff) an error is raised. If these attributes are all set to git-theta, no new entry is added, even when the entry is a pattern match instead of an exact match. If the new file has no attributes set, or attributes that don't overlap with git-theta, then a new entry is added. Non-overlapping attributes are copied down if they were set before. When a line has an attribute set multiple times, the *last* one is used so we can add the theta filters at the end to override any previous ones. Added a `is_theta_tracked` function similar to the one from #214 where the test for if a file is tracked by Git-Theta is abstracted into a function. However it is implemented differently as it now handles the subtleties of what attribute line is active at a given time. Only the final line that a path matches is used to set attributes and only the last entry for some key is used. This is now respected. Git attributes support the "unsetting" of attributes (e.g. "-diff"), these attributes are correctly copied around, but they aren't currently logically checked to see if they intersect with git-theta attributes. * bump setup python version
This PR factors a bunch of the environment variable-based configuration out into a
.thetaconfig
file at the root of the repository. This file gets committed and distributed as part of the repo..thetaconfig
is a json file with two top-level keys:repo
andpatterns
.The
repo
key maps to a dictionary of repository-level config fields like the atol and rtol parameters for equality checks, the signature size, pool size and threshold for LSH, and the maximum number of concurrent async jobs.The
patterns
key maps to a list of dictionaries. Each dictionary contains the configuration for a particular glob pattern in the same way that gitattributes are specified for glob patterns. Right now, the only pattern-level configuration is the checkpoint format (PyTorch, TF, etc.).The
.thetaconfig
file is created whenever a user runsgit theta track
for the first time. Each subsequentgit theta track
command adds another pattern-level dictionary to the file.Here is an example of what
.thetaconfig
looks like:In addition to the code change for creating the
.thetaconfig
file, I also factored the.gitattributes
reading/writing/parsing out into analogous classes.Note 1:
In cases where we want to run a
git
command on a file that matches multiple patterns in.thetaconfig
, the behavior is that patterns that appear later in.thetaconfig
shadow previous patterns. This matches the behavior of.gitattributes
.For instance say we have the following
.thetaconfig
:If we run a git command on
dir1/dir2/model.bin
, this would match both patterns, but the second pattern would shadow the first pattern. Thus, the file would be treated as a PyTorch checkpoint. It seems a bit weird since the first pattern exactly matches the file, but this behavior matches Git's parsing behavior for the.gitattributes
file. Since both.gitattributes
and.thetaconfig
are written using thegit theta track
command, we should ensure that the parsing logic for these files is the same.Note 2:
The way we were handling adding gitattributes in the past was slightly wrong. For instance, if we had attributes for the pattern
dir/*.pt
and we rungit theta track dir/model.pt
, then the code would add the theta filter to thedir/*.pt
entry. This is incorrect behavior since we end up tracking more files than what we intend to. This is fixed in the PR.