Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BC] ExplorationWithPolicy #378

Merged
merged 34 commits into from
Oct 8, 2024
Merged

[BC] ExplorationWithPolicy #378

merged 34 commits into from
Oct 8, 2024

Conversation

tvmarino
Copy link
Collaborator

@tvmarino tvmarino commented Oct 3, 2024

ExplorationWithPolicy defines an exploration strategy which specifies which states in a compilation trajectory are to be explored. Does not define how the exploration is carried out.

tvmarino and others added 27 commits September 3, 2024 20:16
given by combine_tfa_policies_lib.get_input_signature()
and action spec given by combine_tfa_policies_lib.get_action_spec()
The combiner policy uses a new timestep spec feature "model_selector"
to select the requested policy at the current state. The feature is
computed as a md5 hash from the respective policies names.
TimeStep. The patch also gives the option to keep the temporary
working_dir by setting keep_temps in compilation_runner.py to a
directory where all temporary working_dirs will be saved.
compiles only using iclang instead of both clang
and iclang.
Changed the unit test for MLGOEnvironment to check if
the clang sessions are equal and respect the interactive_only
variable.
class ExplorationWithPolicy:
"""Policy which selects states for exploration.

Exploration is fascilitated in the following way. First the policy plays
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facilitated (spelling)

Copy link
Collaborator

@mtrofin mtrofin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly comments about contracts with the user.

self._stop_exploration = True
break
self.curr_step += 1
return policy_deca
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something - how is policy_deca changed from line 79?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not change. get_advice is supposed to return the original advice from the policy and what step we are supposed to explore on. That is, where to explore is handled here and how to explore is going to be handled in a different class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so but why would the user, who passed the policy in, care to pick it up again from the return here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExplorationWithPolicy has the following behavior: play actions from the replay_prefix until it is exhausted. Play according to policy after it is exhausted and simultaneously check the conditions for exploration which use explore_policy. If the condition is satisfied at a given step of the trajectory compilation, set explore_step to that step to be used by the exploration logic to take a new decision there. The whole exploration process works in the following way: 1. compile the module with policy and save the trajectory, 2. from 1. we know what step to explore on so construct a new replay prefix which includes all actions before exploration step as played by policy and includes the action selected by the exploration procedure at the exploration step selected in 1. 3. play the replay_prefix from 2. until after the exploration_step, then follow policy until the end. We can make this clear in the current docstring or in the docstring where the exploration logic is contained or maybe more appropriately in the docstring for generate_bc_trajectories.py in the beginning.

@tvmarino tvmarino merged commit d687b9b into google:main Oct 8, 2024
15 checks passed
@tvmarino tvmarino deleted the behavior_cloning branch October 10, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants