ANONYMOUS wrote:
> With policy iteration, my understanding is that:
> - We compute the utilities of a policy
> - Then we compute a new policy according to the utilities we just calculated
> - We repeat this until the policy converges
>
> What I don't understand is why does the policy improve? If we start with policy P and determine the set of utilities U for each state, then using U wouldn't we just get back the same policy P?
Hi Anon,
With policy iteration,
- We use value determination to determine the agent's corresponding utilities if it follows a policy.
- We then use action determination to determine the optimal policy given a utility for each state.
- We then switch to this new policy and repeat the process until the policy converges.
The reason why it improves is because, initially we choose an arbitrary policy without any knowledge of utilities. Afterwards, we compute the utilities of following the initial policy and we can now use action determination to determine the best move for each state. We can then update our policy to the new and more optimal policy. Then we repeat the process to see if we can improve the policy further.
I hope that helps.