-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BC] ExplorationWithPolicy
#378
Conversation
given by combine_tfa_policies_lib.get_input_signature() and action spec given by combine_tfa_policies_lib.get_action_spec() The combiner policy uses a new timestep spec feature "model_selector" to select the requested policy at the current state. The feature is computed as a md5 hash from the respective policies names.
…iler-opt into policy_combiner
TimeStep. The patch also gives the option to keep the temporary working_dir by setting keep_temps in compilation_runner.py to a directory where all temporary working_dirs will be saved.
compiles only using iclang instead of both clang and iclang.
Changed the unit test for MLGOEnvironment to check if the clang sessions are equal and respect the interactive_only variable.
…piler-opt into behavior_cloning
class ExplorationWithPolicy: | ||
"""Policy which selects states for exploration. | ||
|
||
Exploration is fascilitated in the following way. First the policy plays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
facilitated (spelling)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly comments about contracts with the user.
self._stop_exploration = True | ||
break | ||
self.curr_step += 1 | ||
return policy_deca |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing something - how is policy_deca changed from line 79?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not change. get_advice is supposed to return the original advice from the policy and what step we are supposed to explore on. That is, where to explore is handled here and how to explore is going to be handled in a different class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so but why would the user, who passed the policy in, care to pick it up again from the return here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExplorationWithPolicy has the following behavior: play actions from the replay_prefix until it is exhausted. Play according to policy after it is exhausted and simultaneously check the conditions for exploration which use explore_policy. If the condition is satisfied at a given step of the trajectory compilation, set explore_step to that step to be used by the exploration logic to take a new decision there. The whole exploration process works in the following way: 1. compile the module with policy and save the trajectory, 2. from 1. we know what step to explore on so construct a new replay prefix which includes all actions before exploration step as played by policy and includes the action selected by the exploration procedure at the exploration step selected in 1. 3. play the replay_prefix from 2. until after the exploration_step, then follow policy until the end. We can make this clear in the current docstring or in the docstring where the exploration logic is contained or maybe more appropriately in the docstring for generate_bc_trajectories.py in the beginning.
ExplorationWithPolicy
defines an exploration strategy which specifies which states in a compilation trajectory are to be explored. Does not define how the exploration is carried out.