Reinforcement Learning



I am trying to gain some ML knowledge through this exercise. I have a bot which plays decent now. Is it possible to use ML to add upon the current behaviors of the bot. Or would the ML bot always have to be a separate and newly trained bot?


@anirudhkamineni Hi it is possible to train on it by either making its weights better(meaning training them even more) or make a dataset of games played by it and try to train a model on top of it to make it slightly better than the one you made before and iterate through this process again and again. Yes it should be better if you train a new model with a better dataset. Hope it helps:)


My agents have a return-home action, which automatically makes them follow cheapest return path. That way they don’t need to calculate it, which would probably be rather hard to learn.

I briefly experimented with DQN, but I wasn’t able to make it work - I suspect this game might be hard for deterministic policies. My PPO agents seem to fall into deadlocks after they become too sure of their decisions and randomness of their actions decreases. It would be awesome to make DQN work though - I really like the principle of off-policy algorithms.

Well, I definitely wasn’t expecting that kind of performance - the difference between my low gold bot and my high platinum bot are, if I recall correctly: building dropoffs (using very dumb scripted strategy), using clever trick to make ships to be more willing to kill each other, using slightly different network architecture and more reliable movement solver. The core idea and core code are still the same.

The problem is that I can’t directly improve the strategy I am using, because I have no idea what strategy I am actually using. The network is controlling all ship movements, and I can only guess what it is doing. It clearly understands inspiration, on the other hand it most probably does not exactly understand combat.

About choosing reward function - for 4 player games my reward is still simply the halite delivered by a ship to a dropoff. For 2 player games reward is the same, with the exception that when the ships collide, they get rewarded using mentioned earlier trick which I won’t disclose for now.


Outputs for cells without ships are ignored (and masked when computing the loss).
Is that during the training phase, evaluation, or both?

They are ignored during generating games, because I am calculating policy for all map cells and then actually using the cells on which there are ships. Then during training they are also ignored, so the network is not trained to do something meaningless.

@anirudhkamineni, @Abhishek
You likely can add some ML parts of your bot to make it better, but you still need to find good actions that you want it to make.

Reinforcement learning generally works by making actions accordingly to our current policy and sometimes making a random action, and then it tries to guess if the action was good or bad in the long run. In most cases it slowly, gradually improves the current policy.

There is AlphaZero, that works by generating stronger actions using Monte Carlo Tree Search that uses the current network as a heuristic, and then training the network to take the actions recommended by the search. This approach has proven to be extremely good in games with relatively small action spaces like go or chess, but in Halite total action space is absurdly big and some clever reduction would be required to make such search-based algorithm viable.

One could also try to use replays of the best players and train the network to make actions they make, but I guess their strategies are rather sophisticated and might be hard to learn. Neural networks are good at instinctive decisions, not at emulating complicated algorithms.

In most cases you don’t want to retrain the network from scratch - it is slow, as it discards the knowledge that it already has and might be still useful.


@MichalOp Thanks for the write-up!

I am using PPO as a training algorithm. The network currently uses 64x64x5 inputs, and generates 64x64x1 (value estimate) and 64x64x6 (policy) outputs

The 6 channels for policy are 4 for directions, 1 for still and 1 for new dropoff point?

I imagine your policy network is essentially:

ship_value_at_state = discounted_sum(ship.halite_picked_up - halite_spent + halite_dropped_off) + value(final_state_board) * discount_factor

The value network is just a discounted remaining halite collected over the rest of the game? How far are you looking ahead?

I have been working on machine learning in predicting user moves but haven’t dug into RL for this game yet. I posted my starter pack on a separate thread but I’m thinking about expanding it for RL.


I’d love an RL starter pack, since starting is where I seem to have gotten stuck! Haha


May be a dumb question but considering that any machine learning model built on a neural network will be 100MB+ I imagine it’s impossible to actually submit an ML bot.


For a fully-connected neural network that might be true. However, Halite is very well-suited for convolutional neural networks. They are a bit hard to explain without the associated jargon (“kernel”, “convolution”, “channel”, “parameter sharing”), but basically, instead of one huge neural network for the entire board, you use a set of much smaller ones that look at a 3x3 or 5x5 neighborhood and are reused for each square. They are much, much smaller than a fully-connected network.


I’m familiar with CNNs.

Even those are fairly large. The smallest CNNs I’ve seen for image recognition are MobileNet, which come in about 14-16 MB[0]. These contain only three convolutional layers and 1 point-wise convolutional layer at the end [1].

I use convolutional layers in mine, but a final dense layer at the end and my models are way too large. I’ll try to use an architecture more similar to the MobileNet but I’m not sure how effective it’ll be on predicting moves or reinforcement learning application.


The 6 channels for policy are 4 for directions, 1 for still and 1 for new dropoff point?

The last one is for returning, when agent makes it, ship will make move to follow the cheapest path to the closest dropoff. I am currently using scripted strategy to build dropoffs.

Policy network generates probability distributions, and ships make moves sampled from those distributions. For every cell value network tries to predict discounted sum of rewards for the ship that is currently on that cell (although that ship might move in the future, so after the game ends, for every ship rewards are discounted, and then the discounted reward sums are placed on positions that were occupied by that ship at given point of time).

May be a dumb question but considering that any machine learning model built on a neural network will be 100MB+ I imagine it’s impossible to actually submit an ML bot.

My networks are around 700-800 kB in size (10 5x5 convolutional layers with 24 filters each and residual connections + 2 1x1 convolutional layers with 128 filters that emulate fully connected layers) - and making them bigger would make them slower to train, which already takes around 12 hours from scratch. Using networks designed for image recognition for playing halite seems to be an overkill, especially as you probably can’t use pre-trained networks - real-world patterns most likely are not very meaningful in the world of halite.


@MichalOp curious to know if you are releasing your code/ write-up/ post-mortem, would love to see your work


I’ve just added a repo at, without a post mortem for now. I will try to add it this week.