Reinforcement Learning


Hey MichalOP, thanks for the info you have provided about implementing RL in Halite. I was curious how you were going about handling the rewards for each step? Particularly, how do you assign a negative reward/done flag if a ship is destroyed? Or are you just using the total halite of the ship as the reward?


Here is a great resource on Machine Learning Concepts, the following are the topics explained in the article:

  • What is Machine Learning ?
  • What is Machine Learning Language ?
  • How does Machine Learning Work?
  • What is Machine Learning used for ?
  • What is the Difference between Artificial Intelligence and Machine Learning ?
  • Machine Learning vs Deep Learning
  • Types of Machine Learning Algorithms
  • The Best Machine Learning Algorithms
  • Most Used Programming Languages for Machine Learning
  • The Best Open Source Machine Learning Tools
  • Machine Learning Examples
  • Machine Learning Applications
  • What is the future of Machine Learning?
  • How can I learn Machine Learning?


thanks, i’m stuck around rank 75 with handcoded rules… with so many combination of battles and inspiration arrangements need to add some ML concepts to advance…

does anyone have ideas to map game features like ships,halite,dropoffs,etc into numerical values that can go into ML models? also what output to solve for hmm.


Hi. I’m also trying to build a bot with Deep Reinforcement Learning (DeepMind’s DQN). For now, I’m just trying to learn how to collect Halite (I use hardcoded rules for ship and dropoffs generation and to return to home).

In my case, I use a Neural Network to predict the expected amount of halite collected by a ship for each possible action (n, w, s, e, o; i.e., 5 outputs), given a very large state space (input). Typical Deep Reinforcement Learning scenario.

I extract two kinds of features given an allied ship. First, I create a map centring the action on the ship.


In the previous map, obstacles are in the red channel, my dropoffs are in the green channel and halite are in the blue channel.

Then, I extract different numerical features from the same ship, for example:

  • The distance and the direction vector to the closest dropoff.
  • The distance and the direction vector to the closest ‘k’ enemy and allied ships.
  • The amount of halite in the local area.
  • The amount of halite the ship has.

I normalize all the features to be in the range [0,1].

Finally, I create a NN with two inputs: a Convolutional layer (for the ‘image’ of the map) and a Dense layer for the remaining features. As I said, the NN has 5 outputs, one for each action.

Unfortunately, I don’t have a GPU, so is taking a while. I’m training the agent online while playing the game locally (I have to save some state between runs tho, e.g., the replay memory).

However, although it looks the agent is learning, I think that, given the dimension of the state space, it is going to take a looooong time.


Yes. I’m having the same problem with the reward signal. I think it is challenging to design the reward function to teach the agent (ship) efficiently to collect halite AND to return home.


My reward function is halite collected into a ship per turn, and only a 1% reward for actually returning it home. I don’t have any estimates for battles or inspiration though, so the ones fdlazgon mentioned seem along the right lines. Think I need to find the right multiplier for being at risk of enemy attack and also possible inspiration.

Doing it the computer go way and playing out out to the ultimate win/loss as the reward function its pretty much never going to work without using c++/gpu and simulating100k playouts and even then halite has even more branches than go.


@MichalOp i see you’ve made good progress on the leaderboard! with some sneaky inspiration management looks like you could reach the top 10.


@fdiazgon Hi i guess we can give negative reward for collision with other ships and positive rewards for reaching the dropoff and collecting more halite.Hope so it helps.


Hi there,
I’m looking at adding some ML on top of my current bot, and am wondering how you implemented:

Outputs for cells without ships are ignored (and masked when computing the loss).

Is that during the training phase, evaluation, or both?



I am trying to gain some ML knowledge through this exercise. I have a bot which plays decent now. Is it possible to use ML to add upon the current behaviors of the bot. Or would the ML bot always have to be a separate and newly trained bot?


@anirudhkamineni Hi it is possible to train on it by either making its weights better(meaning training them even more) or make a dataset of games played by it and try to train a model on top of it to make it slightly better than the one you made before and iterate through this process again and again. Yes it should be better if you train a new model with a better dataset. Hope it helps:)


My agents have a return-home action, which automatically makes them follow cheapest return path. That way they don’t need to calculate it, which would probably be rather hard to learn.

I briefly experimented with DQN, but I wasn’t able to make it work - I suspect this game might be hard for deterministic policies. My PPO agents seem to fall into deadlocks after they become too sure of their decisions and randomness of their actions decreases. It would be awesome to make DQN work though - I really like the principle of off-policy algorithms.

Well, I definitely wasn’t expecting that kind of performance - the difference between my low gold bot and my high platinum bot are, if I recall correctly: building dropoffs (using very dumb scripted strategy), using clever trick to make ships to be more willing to kill each other, using slightly different network architecture and more reliable movement solver. The core idea and core code are still the same.

The problem is that I can’t directly improve the strategy I am using, because I have no idea what strategy I am actually using. The network is controlling all ship movements, and I can only guess what it is doing. It clearly understands inspiration, on the other hand it most probably does not exactly understand combat.

About choosing reward function - for 4 player games my reward is still simply the halite delivered by a ship to a dropoff. For 2 player games reward is the same, with the exception that when the ships collide, they get rewarded using mentioned earlier trick which I won’t disclose for now.


Outputs for cells without ships are ignored (and masked when computing the loss).
Is that during the training phase, evaluation, or both?

They are ignored during generating games, because I am calculating policy for all map cells and then actually using the cells on which there are ships. Then during training they are also ignored, so the network is not trained to do something meaningless.

@anirudhkamineni, @Abhishek
You likely can add some ML parts of your bot to make it better, but you still need to find good actions that you want it to make.

Reinforcement learning generally works by making actions accordingly to our current policy and sometimes making a random action, and then it tries to guess if the action was good or bad in the long run. In most cases it slowly, gradually improves the current policy.

There is AlphaZero, that works by generating stronger actions using Monte Carlo Tree Search that uses the current network as a heuristic, and then training the network to take the actions recommended by the search. This approach has proven to be extremely good in games with relatively small action spaces like go or chess, but in Halite total action space is absurdly big and some clever reduction would be required to make such search-based algorithm viable.

One could also try to use replays of the best players and train the network to make actions they make, but I guess their strategies are rather sophisticated and might be hard to learn. Neural networks are good at instinctive decisions, not at emulating complicated algorithms.

In most cases you don’t want to retrain the network from scratch - it is slow, as it discards the knowledge that it already has and might be still useful.


@MichalOp Thanks for the write-up!

I am using PPO as a training algorithm. The network currently uses 64x64x5 inputs, and generates 64x64x1 (value estimate) and 64x64x6 (policy) outputs

The 6 channels for policy are 4 for directions, 1 for still and 1 for new dropoff point?

I imagine your policy network is essentially:

ship_value_at_state = discounted_sum(ship.halite_picked_up - halite_spent + halite_dropped_off) + value(final_state_board) * discount_factor

The value network is just a discounted remaining halite collected over the rest of the game? How far are you looking ahead?

I have been working on machine learning in predicting user moves but haven’t dug into RL for this game yet. I posted my starter pack on a separate thread but I’m thinking about expanding it for RL.


I’d love an RL starter pack, since starting is where I seem to have gotten stuck! Haha


May be a dumb question but considering that any machine learning model built on a neural network will be 100MB+ I imagine it’s impossible to actually submit an ML bot.


For a fully-connected neural network that might be true. However, Halite is very well-suited for convolutional neural networks. They are a bit hard to explain without the associated jargon (“kernel”, “convolution”, “channel”, “parameter sharing”), but basically, instead of one huge neural network for the entire board, you use a set of much smaller ones that look at a 3x3 or 5x5 neighborhood and are reused for each square. They are much, much smaller than a fully-connected network.


I’m familiar with CNNs.

Even those are fairly large. The smallest CNNs I’ve seen for image recognition are MobileNet, which come in about 14-16 MB[0]. These contain only three convolutional layers and 1 point-wise convolutional layer at the end [1].

I use convolutional layers in mine, but a final dense layer at the end and my models are way too large. I’ll try to use an architecture more similar to the MobileNet but I’m not sure how effective it’ll be on predicting moves or reinforcement learning application.


The 6 channels for policy are 4 for directions, 1 for still and 1 for new dropoff point?

The last one is for returning, when agent makes it, ship will make move to follow the cheapest path to the closest dropoff. I am currently using scripted strategy to build dropoffs.

Policy network generates probability distributions, and ships make moves sampled from those distributions. For every cell value network tries to predict discounted sum of rewards for the ship that is currently on that cell (although that ship might move in the future, so after the game ends, for every ship rewards are discounted, and then the discounted reward sums are placed on positions that were occupied by that ship at given point of time).

May be a dumb question but considering that any machine learning model built on a neural network will be 100MB+ I imagine it’s impossible to actually submit an ML bot.

My networks are around 700-800 kB in size (10 5x5 convolutional layers with 24 filters each and residual connections + 2 1x1 convolutional layers with 128 filters that emulate fully connected layers) - and making them bigger would make them slower to train, which already takes around 12 hours from scratch. Using networks designed for image recognition for playing halite seems to be an overkill, especially as you probably can’t use pre-trained networks - real-world patterns most likely are not very meaningful in the world of halite.


@MichalOp curious to know if you are releasing your code/ write-up/ post-mortem, would love to see your work


I’ve just added a repo at, without a post mortem for now. I will try to add it this week.