In: Other
For the environment shown in Figure 17.1, find all the threshold values for R(s) such that the optimal policy changes when the threshold is crossed. You will need a way to calculate the optimal policy and its value for fixed R(s).
Figure 17.1
One useful observation in this context is that the expected total reward of any fixed policy is linear in r, the per-step reward for the empty states. Imagine drawing the total reward of a policy as a function of r x a straight line. Now draw all the straight lines corresponding to all possible policies. The reward of the optimal policy as a function of r is just the max of all these straight lines. Therefore it is a piece wise linear, convex function of r.
Hence there is a very efficient way to find all the optimal policy regions:
The policies and boundaries derived from this procedure are shown in Figure S17.1. The figure shows that there are nine distinct optimal policies! Notice that as r becomes more negative, the agent becomes more willing to dive straight into the –1 terminal state rather than face the cost of the detour to the +1 state.
The somewhat ugly code is as follows. That because the lines for neighboring policies are very nearly parallel, numerical instability is a serious problem.
Figure S17.1