My agents have a return-home action, which automatically makes them follow cheapest return path. That way they don’t need to calculate it, which would probably be rather hard to learn.
I briefly experimented with DQN, but I wasn’t able to make it work - I suspect this game might be hard for deterministic policies. My PPO agents seem to fall into deadlocks after they become too sure of their decisions and randomness of their actions decreases. It would be awesome to make DQN work though - I really like the principle of off-policy algorithms.
Well, I definitely wasn’t expecting that kind of performance - the difference between my low gold bot and my high platinum bot are, if I recall correctly: building dropoffs (using very dumb scripted strategy), using clever trick to make ships to be more willing to kill each other, using slightly different network architecture and more reliable movement solver. The core idea and core code are still the same.
The problem is that I can’t directly improve the strategy I am using, because I have no idea what strategy I am actually using. The network is controlling all ship movements, and I can only guess what it is doing. It clearly understands inspiration, on the other hand it most probably does not exactly understand combat.
About choosing reward function - for 4 player games my reward is still simply the halite delivered by a ship to a dropoff. For 2 player games reward is the same, with the exception that when the ships collide, they get rewarded using mentioned earlier trick which I won’t disclose for now.
Outputs for cells without ships are ignored (and masked when computing the loss).
Is that during the training phase, evaluation, or both?
They are ignored during generating games, because I am calculating policy for all map cells and then actually using the cells on which there are ships. Then during training they are also ignored, so the network is not trained to do something meaningless.
You likely can add some ML parts of your bot to make it better, but you still need to find good actions that you want it to make.
Reinforcement learning generally works by making actions accordingly to our current policy and sometimes making a random action, and then it tries to guess if the action was good or bad in the long run. In most cases it slowly, gradually improves the current policy.
There is AlphaZero, that works by generating stronger actions using Monte Carlo Tree Search that uses the current network as a heuristic, and then training the network to take the actions recommended by the search. This approach has proven to be extremely good in games with relatively small action spaces like go or chess, but in Halite total action space is absurdly big and some clever reduction would be required to make such search-based algorithm viable.
One could also try to use replays of the best players and train the network to make actions they make, but I guess their strategies are rather sophisticated and might be hard to learn. Neural networks are good at instinctive decisions, not at emulating complicated algorithms.
In most cases you don’t want to retrain the network from scratch - it is slow, as it discards the knowledge that it already has and might be still useful.