In: Computer Science
Is there a systematic way to determine which action value-based learning method (Q-learning and SARSA) is a better choice and can achieve better results? Explain.
SARSA and Q- learning, both are reinforcement learning algorithms that work in same way. The major difference is that SARSA is on policy while Q-learning is off policy.
The update rules for SARSA and Q-learning are given in the image below,
Actually in both SARSA and Q-learning, we take the actual single generated action next. Here we can note the difference that, in Q-learning, we update the estimate from the maximum available estimate of possible next actions, regradless of which action we took. While in SARSA, we update the estimates based on actual action.
There is a way for us to determine which action value based learning method ( Q-learning and SARSA ) is a better choice and can achieve better results, for that we can compare Q-learning with SARSA, and finally we can determine which is better choice and achieve better results based on diffrent parameters.
1. When we consider Q-learning, it directly learns the optimal policy, while SARSA learns a near optimal policy while exploring. For learning an optimal policy using SARSA, we need to decide on a method to decay in -greedy action choice.
2. If there is a large negative reward close to the optimal path, Q-learning will tend to trigger that reward while exploring, while SARSA will avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced.
3. Q Learning has higher per-sample variance than SARSA, and may suffer from problems converging as a result. When we train a neural network via Q-learning, this will be a problem.