Skip to content

Repo-level and pattern-level configuration in .thetaconfig #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nkandpa2
Copy link
Collaborator

@nkandpa2 nkandpa2 commented May 4, 2023

This PR factors a bunch of the environment variable-based configuration out into a .thetaconfig file at the root of the repository. This file gets committed and distributed as part of the repo.

.thetaconfig is a json file with two top-level keys: repo and patterns.

The repo key maps to a dictionary of repository-level config fields like the atol and rtol parameters for equality checks, the signature size, pool size and threshold for LSH, and the maximum number of concurrent async jobs.

The patterns key maps to a list of dictionaries. Each dictionary contains the configuration for a particular glob pattern in the same way that gitattributes are specified for glob patterns. Right now, the only pattern-level configuration is the checkpoint format (PyTorch, TF, etc.).

The .thetaconfig file is created whenever a user runs git theta track for the first time. Each subsequent git theta track command adds another pattern-level dictionary to the file.

Here is an example of what .thetaconfig looks like:

{
    "repo": {
        "parameter_atol": 1e-08,
        "parameter_rtol": 1e-05,
        "lsh_signature_size": 16,
        "lsh_threshold": 1e-06,
        "lsh_pool_size": 10000,
        "max_concurrency": -1
    },
    "patterns": [
        {
            "pattern": "model.pt",
            "checkpoint_format": "pytorch"
        }
    ]
}

In addition to the code change for creating the .thetaconfig file, I also factored the .gitattributes reading/writing/parsing out into analogous classes.

Note 1:
In cases where we want to run a git command on a file that matches multiple patterns in .thetaconfig, the behavior is that patterns that appear later in .thetaconfig shadow previous patterns. This matches the behavior of .gitattributes.

For instance say we have the following .thetaconfig:

{
    "repo": {
        "parameter_atol": 1e-08,
        "parameter_rtol": 1e-05,
        "lsh_signature_size": 16,
        "lsh_threshold": 1e-06,
        "lsh_pool_size": 10000,
        "max_concurrency": -1
    },
    "patterns": [
        {
            "pattern": "dir1/dir2/model.bin",
            "checkpoint_format": "tf"
        },
        {
            "pattern": "dir1/dir2/**",
            "checkpoint_format": "pytorch"
        }
    ]
}

If we run a git command on dir1/dir2/model.bin, this would match both patterns, but the second pattern would shadow the first pattern. Thus, the file would be treated as a PyTorch checkpoint. It seems a bit weird since the first pattern exactly matches the file, but this behavior matches Git's parsing behavior for the .gitattributes file. Since both .gitattributes and .thetaconfig are written using the git theta track command, we should ensure that the parsing logic for these files is the same.

Note 2:
The way we were handling adding gitattributes in the past was slightly wrong. For instance, if we had attributes for the pattern dir/*.pt and we run git theta track dir/model.pt, then the code would add the theta filter to the dir/*.pt entry. This is incorrect behavior since we end up tracking more files than what we intend to. This is fixed in the PR.



def get_checkpoint_handler(checkpoint_type: Optional[str] = None) -> Checkpoint:
"""Get the checkpoint handler either by name or from an environment variable.
def get_checkpoint_handler(checkpoint_path: str) -> Checkpoint:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is doing too much now, based on SoC, this function should really be about translating a checkpoint name into a class. We should probably have the path -> checkpoint type be done externally and then the result is passed in.

return checkpoint_type or utils.EnvVarConstants.CHECKPOINT_TYPE
repo = git_utils.get_git_repo()
checkpoint_path = git_utils.get_relative_path_from_root(repo, checkpoint_path)
thetaconfig = config.ThetaConfigFile(repo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to result in reading the config file a bunch of time yeah? Should we refactor to something like the config file is read once and then the config object is used?

# TODO we need a better way of keeping track of configuration at the repository level
# For LSH configuration, once it is set for a repository, changing it should be handled with care
repo = git_utils.get_git_repo()
thetaconfig = config.ThetaConfigFile(repo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make sure that we decouple a lot of these functions from needing there to be an actual config, otherwise they get really hard to reuse or use in tests.

We probably want things like the signature size, atol, etc to be parameters where None values result in looking it up in the config?

# N.b. we use a fixed seed so that every instance of RandomPool has the same set of random numbers
rng = Generator(MT19937(seed=42))
pool = rng.normal(size=EnvVarConstants.LSH_POOL_SIZE)
pool = rng.normal(size=thetaconfig.repo_config.lsh_pool_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool size should probably be a parameter where a None results in a config look up, the same way signature size is

@@ -26,7 +26,8 @@ class Update(metaclass=ABCMeta):

name: str = NotImplemented # The name used to lookup the plug-in.

def __init__(self, serializer: params.Serializer, *args, **kwargs):
def __init__(self, path: str, serializer: params.Serializer, *args, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class never needs the path right? Can we move this to IncrementalUpdate init like the update_data parameter?

class GitAttributesFile:
def __init__(self, repo: git.Repo):
self.repo = repo
self.file = os.path.join(repo.working_dir, ".gitattributes")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as below wrt to holding onto the file name.

pattern_attrs.add_attribute("filter=theta")
pattern_attrs.add_attribute("merge=theta")
pattern_attrs.add_attribute("diff=theta")
pattern_found = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment about how we don't have a break here because gitattribute shadow behavior means we want to update every line where the pattern appears?

def __init__(self, serializer: params.Serializer, update_data: str = ""):
super().__init__(serializer)
def __init__(self, path: str, serializer: params.Serializer, update_data: str = ""):
super().__init__(path, serializer)
self.update_information: Dict[str, np.ndarray] = None
self.update_names: utils.Trie = None
# Flatten the side-loaded information into a of string keys to arrays.
if update_data:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move the reading of sideloaded information out of the IncrementalUpdate class and have it expect an Optional[Dict[str, np.ndarray]] instead? That would remove the need to pass path into the updates everywhere right? Updaters created during the recursive part get their parameter data from git so they don't need to get passed update_data

@@ -71,15 +71,8 @@ def __get__(self, obj, objtype=None):


class EnvVarConstants:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should go full one way or the other for configuration? Right now somethings are in the file and some things are env variables. My original though was that the config file would just populate env-vars with default values so all lookups are via envvars.

Do we want some new collection of values on the config object that are like runtime config values? That is where things like update type or update data path would live? They could have defaults in the config but get overwritten by EnvVars, kinda like the jax config object?

self.repo_config, self.pattern_configs = ThetaConfigFile.read(self.file)

@classmethod
def read(cls, file) -> Tuple[RepoConfig, List[PatternConfig]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like with this approach we are loosing the ability to tweak values from and EnvVar at runtime? We definitely want to be able to do that for things like max concurrency and I could see wanting to do that with other things too,

blester125 added a commit to blester125/git-theta that referenced this pull request Jun 26, 2023
This change is based on a part of
r-three#214 where the test for if a
file is tracked by Git-Theta is abstracted into a function.

However it is implemented differently as it now handles the subtleties of
what attribute line is active at a given time.

Only the final line that a path matches is used to set attributes and
only the last entry for some key is used. This is now respected.
blester125 added a commit to blester125/git-theta that referenced this pull request Oct 18, 2023
Currently, when a pattern in `.gitattributes` matches the path in
`git-theta track`, the theta attributes (filter, merge, diff) are added
to that line. This can cause issues as it may result in tracking more
files with git-theta than expected.

However, just adding a new line that is an exact match for the path to
the end of the gitattribute file is not correct either. The *last*
attribute line in the file is the one used by Git. This could result in
`git-theta track` removing an some other attribute that is set for that
file.

This PR updates the way that git attributes are set. If there are
already git attributes set to non-theta values for the file that are used by git-theta
(filter, merge, diff) an error is raised. If these attributes are all
set to git-theta, no new entry is added, even when the entry is a
pattern match instead of an exact match. If the new file has no
attributes set, or attributes that don't overlap with git-theta, then a
new entry is added. Non-overlapping attributes are copied down if they
were set before.

When a line has an attribute set multiple times, the *last* one is used
so we can add the theta filters at the end to override any previous
ones.

Added a `is_theta_tracked` function similar to the one from
r-three#214 where the test for if a
file is tracked by Git-Theta is abstracted into a function.

However it is implemented differently as it now handles the subtleties of
what attribute line is active at a given time.

Only the final line that a path matches is used to set attributes and
only the last entry for some key is used. This is now respected.
blester125 added a commit to blester125/git-theta that referenced this pull request Oct 19, 2023
Currently, when a pattern in `.gitattributes` matches the path in
`git-theta track`, the theta attributes (filter, merge, diff) are added
to that line. This can cause issues as it may result in tracking more
files with git-theta than expected.

However, just adding a new line that is an exact match for the path to
the end of the gitattribute file is not correct either. The *last*
attribute line in the file is the one used by Git. This could result in
`git-theta track` removing an some other attribute that is set for that
file.

This PR updates the way that git attributes are set. If there are
already git attributes set to non-theta values for the file that are used by git-theta
(filter, merge, diff) an error is raised. If these attributes are all
set to git-theta, no new entry is added, even when the entry is a
pattern match instead of an exact match. If the new file has no
attributes set, or attributes that don't overlap with git-theta, then a
new entry is added. Non-overlapping attributes are copied down if they
were set before.

When a line has an attribute set multiple times, the *last* one is used
so we can add the theta filters at the end to override any previous
ones.

Added a `is_theta_tracked` function similar to the one from
r-three#214 where the test for if a
file is tracked by Git-Theta is abstracted into a function.

However it is implemented differently as it now handles the subtleties of
what attribute line is active at a given time.

Only the final line that a path matches is used to set attributes and
only the last entry for some key is used. This is now respected.

Git attributes support the "unsetting" of attributes (e.g. "-diff"),
these attributes are correctly copied around, but they aren't currently logically
checked to see if they intersect with git-theta attributes.
blester125 added a commit that referenced this pull request Jan 24, 2024
* Update how `git-theta track ${path}` works.

Currently, when a pattern in `.gitattributes` matches the path in
`git-theta track`, the theta attributes (filter, merge, diff) are added
to that line. This can cause issues as it may result in tracking more
files with git-theta than expected.

However, just adding a new line that is an exact match for the path to
the end of the gitattribute file is not correct either. The *last*
attribute line in the file is the one used by Git. This could result in
`git-theta track` removing an some other attribute that is set for that
file.

This PR updates the way that git attributes are set. If there are
already git attributes set to non-theta values for the file that are used by git-theta
(filter, merge, diff) an error is raised. If these attributes are all
set to git-theta, no new entry is added, even when the entry is a
pattern match instead of an exact match. If the new file has no
attributes set, or attributes that don't overlap with git-theta, then a
new entry is added. Non-overlapping attributes are copied down if they
were set before.

When a line has an attribute set multiple times, the *last* one is used
so we can add the theta filters at the end to override any previous
ones.

Added a `is_theta_tracked` function similar to the one from
#214 where the test for if a
file is tracked by Git-Theta is abstracted into a function.

However it is implemented differently as it now handles the subtleties of
what attribute line is active at a given time.

Only the final line that a path matches is used to set attributes and
only the last entry for some key is used. This is now respected.

Git attributes support the "unsetting" of attributes (e.g. "-diff"),
these attributes are correctly copied around, but they aren't currently logically
checked to see if they intersect with git-theta attributes.

* bump setup python version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants