In: Other
Consider the 3 × 3 world shown in Figure 17.14(a). The transition model is the same as in the 4 × 3 Figure 17.1: 80% of the time the agent goes in the direction it selects; the rest of the time it moves at right angles to the intended direction.
Implement value iteration for this world for each value of r below. Use discounted rewards with a discount factor of 0.99. Show the policy obtained in each case. Explain intuitively why the value of r leads to each policy.
a. r = 100
b. r = ˆ’3
c. r = 0
d. r = +3
Figure 17.1
a.
r = 100.
See the comments for part d. This should have been r = −100 to illustrate an alternative behavior:
Here, the agent tries to reach the goal quickly, subject to attempting to avoid the square (1, 3) as much as possible. Note that the agent will choose to move Down in square (1, 2) in order to actively avoid the possibility of “accidentally” moving into the square (1, 3) if it tried to move Right instead, since the penalty for moving into square (1, 3) is so great.
b.
r = −3.
Here, the agent again tries to reach the goal as fast as possible while attempting to avoid the square (1, 3), but the penalty for square (1, 3) is not so great that the agent will try to actively avoid it at all costs. Thus, the agent will choose to move Right in square (1, 2) in order to try to get closer to the goal even if it occasionally will result in a transition to square (1, 3).
c.
r = 0
Here, the agent again tries to reach the goal as fast as possible, but will try to do so via a path that includes square (1, 3) if possible. This results from the fact that square (1, 3) does not incur the reward of −1 in all other non-goal states, so it reaching the goal via a path through that square can potentially have slightly greater reward than another path of equal length that does not pass through (1, 3).
d.
r = 3.
r = 100.