My understanding is that even if we use the utilities of the original policy, our policy choices will still change as there is more information available
For example, if you started with a 2x2 grid with a step cost of 1 and a bottom right terminal of plus +10 and the original policy was to always move down, we would start with a policy of:
D D
D -
and utilities of
-1 9
0 +10
Then in the second iteration, our policy would change based on the new utilities that we have for each state, at each iterative step, you check each state to see if there is a possible move that would now result in a higher utility, so the policy becomes:
-> D
-> -
and utilities of
8 9
9 +10
>From here, since this is the best move that each state can make, this is the best policy and the iteration is complete. Even though these are the correct final utlities in this case, they don't have to be for policy iteration to be complete, it's just at the step where none of the actions in the policy change.
This is how I understand policy iteration, hopefully someone can confirm if this is correct