Please consider offering answers and suggestions to help other students!
And if you fix a problem by following a suggestion here,
it would be great if other interested students could see a short
"Great, fixed it!" followup message.
Hello, my agent was getting rank 1 consistently, but sometimes it got the second rank; when I investigated the cause, I found this on the leaderboard.
LEADERBOARD AFTER 1000 GAMES
Resistance Wins: 268, Spy Wins: 732, Resistance Win Rate: 0.2680
1: SatisfactoryAgent | win_rate=0.4847 res_win_rate=0.2744 spy_win_rate=0.8027 | errors=0 games=1859 wins=901 losses=958 res=1119 spy=740 res_wins=307 res_losses=812 spy_wins=594 spy_losses=146
2: DumbAgent | win_rate=0.4762 res_win_rate=0.2809 spy_win_rate=0.8253 | errors=0 games=1915 wins=912 losses=1003 res=1228 spy=687 res_wins=345 res_losses=883 spy_wins=567 spy_losses=120
3rd and 4rth are irrelevant.
If you look at the res_win_rate and spy_win_rate, my agent's individual rates are higher than the SatisfactoryAgent's win rates. I still lost because my agent played 1915 games while SatisfactoryAgent played only 1859.
I would like to argue that my agent still beat SatisfactoryAgent because if you play more games, you will obviously lose more.
Just in case, I'll add this leaderboard for the next tournament
LEADERBOARD AFTER 1000 GAMES
Resistance Wins: 272, Spy Wins: 728, Resistance Win Rate: 0.2720
1: DumbAgent | win_rate=0.5095 res_win_rate=0.3138 spy_win_rate=0.8441 | errors=0 games=1843 wins=939 losses=904 res=1163 spy=680 res_wins=365 res_losses=798 spy_wins=574 spy_losses=106
2: SatisfactoryAgent | win_rate=0.4739 res_win_rate=0.2693 spy_win_rate=0.7958 | errors=0 games=1840 wins=872 losses=968 res=1125 spy=715 res_wins=303 res_losses=822 spy_wins=569 spy_losses=146
In this leaderboard, there is only a difference of 3 games between the agents, and my agent still won by 3%.
I would like Mr. Andrew's view on this. Will you consider this difference between the number of games a significant factor while evaluating?
Remember that the testing code provided is just a simple tournament boiling it down to a single empirically-measured number.
In this case, what it seems really happened is that the SatisfactoryAgent lucked out and got an anomalously high proportion of games as spy while yours was anomalously low. This is the nature of statistical empirical measures, and we know enough samples should converge on the mean. The probability of your agent losing in a tournament should be very low if it is a good agent.
The tournament code is simpler than the tests we plan to do for assessment. Recall that part of the assignment is you are meant to assess the effectiveness of your agent yourself and report on that.
We will do our best to accurately assess your agent's performance compared to the benchmarks, and if your agent is able to consistently outperform the benchmarks to any statistically significant degree, this should be reflected in the result. As with any statistical system, there is a chance of a false positive or false negative. I expect false negatives in the marking system to be extremely unlikely unless you have made an agent so close in capability to the benchmark that it may not actually be meaningful to say it outperforms it, and hence would not be a false negative, but a true negative.
For example: Even with this random variation, I assume you have never seen the RandomAgent outperform your agent? It is technically possible, but statistically unlikely enough that we should be able to assume we will never see it. You should be aiming to make the best agent you can, and it is definitely possible to be enough better than the SatisfactoryAgent that we would expect to see the same effect. The SatisfactoryAgent is a deliberately still quite low bar so that clearing it by a clear statistically significant margin is easy.
So all that is to say: I expect that if you are losing any appreciable fraction of tournaments due to bad luck, then your agent probably can't be considered to consistently outperform the benchmark. It should be extremely unlikely for a good agent to do so. This is demonstrated by the fact that we have agents that were made as part of preparing this project that we have never seen lose a tournament against any of the benchmarks.
Hope that helps.
Cheers,
Gozz