-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHowtoRun.txt
149 lines (129 loc) · 6.67 KB
/
HowtoRun.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
1. Create traj data
python -m sandbox.mack.run_simple --env simple_tag
python -m sandbox.mack.run_est_reward --env simple_speaker_listener --logdir /irl/test --rewdir \atlas\u\lantaoyu\exps\airl\simple_speaker_listener\decentralized\s-200\l-0.1-b-1000-d-0.1-c-500-l2-0.1-iter-1-r-0.0\seed-1/ --irl_iteration 3000
###
@click.option('--logdir', type=click.STRING, default='/atlas/u/lantaoyu')
@click.option('--env', type=click.Choice(['simple', 'simple_speaker_listener',
'simple_crypto', 'simple_push',
'simple_tag', 'simple_spread',
'simple_adversary']))
@click.option('--lr', type=click.FLOAT, default=0.1)
@click.option('--seed', type=click.INT, default=1)
@click.option('--batch_size', type=click.INT, default=1000)
@click.option('--atlas', is_flag=True, flag_value=True)
###
2. Create pkl file
Write log file name to path in irl/renoer.py
python -m irl.render --env simple_tag
###
@click.option('--env', type=click.STRING)
@click.option('--image', is_flag=True, flag_value=True)
###
3. Execute program
python -m irl.mack.run_mack_airl --env simple_tag --expert_path /atlas/u/lantaoyu/exps/mack/simple_tag/l-0.1-b-1000/seed-1/checkpoint01000-100tra.pkl
###
@click.option('--logdir', type=click.STRING, default='/atlas/u/lantaoyu/exps')
@click.option('--env', type=click.STRING, default='simple_spread')
@click.option('--expert_path', type=click.STRING,
default='/atlas/u/lantaoyu/projects/MA-AIRL/mack/simple_spread/l-0.1-b-1000/seed-1/checkpoint20000-1000tra.pkl')
@click.option('--seed', type=click.INT, default=1)
@click.option('--traj_limitation', type=click.INT, default=200)
@click.option('--ret_threshold', type=click.FLOAT, default=-10)
@click.option('--dis_lr', type=click.FLOAT, default=0.1)
@click.option('--disc_type', type=click.Choice(['decentralized',
'decentralized-all']),
default='decentralized')
@click.option('--bc_iters', type=click.INT, default=500)
@click.option('--l2', type=click.FLOAT, default=0.1)
@click.option('--d_iters', type=click.INT, default=1)
@click.option('--rew_scale', type=click.FLOAT, default=0)
###
4. Replay
Write log file name to path in irl/renoer.py
python -m irl.render --env simple_tag --image
5. Confirm reward function
python create_graphs.py --env simple_tag
6. Confirm results
0, 1 are agent number
reward is estimated reward
{"nupdates": 30238, "total_timesteps": 30238000, "fps": 1170, "explained_variance 0": -11.982209205627441, "policy_entropy 0": 0.02122282050549984, "policy_loss 0": 0.0007167136645875871, "value_loss 0": 0.0003796979144681245, "pearson 0": 0.8006086569834931, "spearman 0": 0.9388224988216638, "reward 0": -0.0068428716622292995, "explained_variance 1": 0.8300132900476456, "policy_entropy 1": 0.08465230464935303, "policy_loss 1": -0.008579825051128864, "value_loss 1": 0.044064126908779144, "pearson 1": 0.9197493296555507, "spearman 1": 0.9721278956855531, "reward 1": -1.3455829620361328, "total_loss 0": 0.705395519733429, "total_loss 1": 0.6521978974342346}
MA-AIRL
Expert
1. How to give expert in ma-airl model.
2. Is expert policy or trajectories? What type of data is expert?
3. expert handle in dataset.py
######
simple_tag
traj_data[0].keys() = dict_keys(['ob', 'ac', 'rew', 'ep_ret', 'all_ob'])
ob:observation, ac:action, rew:reward, ep_ret:累積reward
num_trajs = len(traj_data) = 100 (render.py in irl)
num_agents = len(traj_data['ob']) = 4
max_step = len(traj_data['ob'][0]) = 50 (render.py in irl or Done func in simple_tag)
num_obs = len(traj_data['ob'][0][0]) = 16
###
obs = np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel)
= self vel(1) + self pos(2) + landmarks pos(2*2) + other pos(2*3) + other
vel(1*3)
###
######
######
simple_spread
###
obs = np.concatenate([agent.state.p_vel] + entity_pos + other_pos + comm)
reward = ランドマークとエージェントの距離の最小値のマイナス = ランドマークとエージェントの距離の最小化
for l in world.landmarks:
dists = [np.sqrt(np.sum(np.square(a.state.p_pos - l.state.p_pos))) for a in world.agents]
rew -= min(dists)
###
######
Envs
1. How to access to environment.
Create by make_model function -> make easy by making Maze env in
multi-agent-particle-envs.
2. Action
action_space = dim.p*2+1
[1,0,0,0,0] = Stay
[0,1,0,0,0] = Right
[0,0,1,0,0] = Left
[0,0,0,1,0] = Up
[0,0,0,0,1] = Down
Learning
Policy
1. CategoricalPolicy
2. GaussianPolicy
3. MultiCategoricalPolicy
めも
irl/mack/airl.py
555: runer開始
362: model.step
286: policy(74:step_modelに格納されてる)から行動
373: env.step
489: return mb_obs, mb_obs_next, mb_states, mb_returns, mb_report_returns, mb_masks, mb_actions, \
mb_values, mb_all_obs, mb_all_nobs, mh_actions, mh_all_actions, mb_rewards, mb_true_rewards, mb_true_returns
588: train discriminator 開始 (irl/mack/kfac_discriminator_airl)
66: reward = relu(tf.concat([obs, act]))
81: loss = -tf.reduce_mean(labels*(log_p_tau - log_pq) + (1 - labels) * (log_q_tau - log_pq)) (expertである確率, Agentである確率の最大化)
76: log_q_tau = lprobs
77: log_p_tau = reward(obs) + gamma*V(n_obs) - V(obs) = relu(obs) + gamma * relu(nobs) - relu(obs) (状態の価値の予測値との差)
78: log_pq = tf.reduce_log_sumexp([log_p_tau, log_q_tau]) (log各行の数 )
84: loss += tf.add_n([tf.nn_l2_less(v) for v in params]) * ratio
101: optim = tf.train.Adam().apply_grad([grads, params])
83: params = find_trainable_variables() (rl/acktr/utils)
87: grads = tf.gradiant(loss, params)
182: train generator 開始 (irl/mack/airl.py)
148: optimize func for policy(generator)
# I don't know what do 188 line process.
188~: new_map.update({
train_model[k].X: np.concatenate([obs[j] for j in range(k, pointer[k])], axis=0),
train_model[k].X_v: np.concatenate([ob.copy() for j in range(k, pointer[k])], axis=0),
A[k]: np.concatenate([actions[j] for j in range(k, pointer[k])], axis=0),
???ADV[k]: np.concatenate([advs[j] for j in range(k, pointer[k])], axis=0),
???R[k]: np.concatenate([rewards[j] for j in range(k, pointer[k])], axis=0),
???PG_LR[k]: cur_lr / float(scale[k])
})
sess.run(train_ops[k], feed_dict=new_map)
td_map.update(new_map)
policy_loss, value_loss, policy_entropy = sess.run(
[pg_loss, vf_loss, entropy],
td_map
)