Reinforcement Learning


I would love to try my hand at an RL bot but I’m having some trouble getting started. Any recommendations on some reading material? Also, I keep getting stuck on the concept of an OpenAI Gym, which many popular libraries seem to use. I’m not sure how I can approach that with a multi-agent bot running on an external game engine. Thoughts?


I saw this in a presentation recently. This isn’t the exact github I saw presented.


One idea to deal with running on an external game engine would be to split generating samples and training the network into two different programs.

You could create a bot that loads the current network, plays the game, and saves the observations, actions and rewards to file while playing. Then another program could load them and use them to train the network. Then you would launch a game again to generate the samples using the new network, etc.

I am using a bit different solution, but the one I’ve proposed above is definitely simpler and most probably better than what I have. I have a simple bot that communicates with an external python program through pipes, and that external program launches several instances of halite executable with my bots that then communicate with the parent program. The theoretical advantage is that you can potentially compute your network for many environments with a single GPU run, and have only one network in memory. The practical disadvantage is that writing the interface is not particularly pleasant, and the interface itself is not necessarily very fast either.

About controlling multiple ships within one game… Who knows? The simplest way would probably be to treat each ship as it was living in it’s own environment with it’s own reward, which is roughly what I am doing. I am using a single CNN to compute values for the whole board in a single run and then use only fields on which there are ships, but I am tracking every ship and it’s rewards separately. (In my current ladder version ships are completely selfish and each cares only about how much reward it carries home)

Edit: Some reading material (also from OpenAI):
I haven’t read very deep into it, but it seems like there most probably is absolutely everything that can be said about RL. A lot of math though.


Interesting ideas… Thank you!


hi nice work on explaining about your approach . my doubt is how to separate the ship fields from the CNN feature map . and also like to know how is your RL based bot preforming so far


It is currently in low gold. It has two non-RL things - there is a movement solver that guarantees allied ships won’t collide, and network can decide to automatically return to base using the cheapest path. Ship building is currently also scripted.

I am using PPO as a training algorithm. The network currently uses 64x64x5 inputs, and generates 64x64x1 (value estimate) and 64x64x6 (policy) outputs. If there is a ship on a cell (x,y), it will make move with probability distribution policy[x,y], and the value estimate[x,y] will be later used to calulate advantage for that ship. On the end of the game I calculate values and advantages for all ships, and then place those values on 64x64 maps in right places and use them for training. Outputs for cells without ships are ignored (and masked when computing the loss).


How successful do you think an RL bot could be? I’m going to give it a shot for the learning opportunity, but expect to perform poorly. Without access to a lot of computational power, this seems very difficult given the timeframe.

Can you give any more tips without compromising your bot’s success?

Congratulations on making it up to #330 by the way! If you haven’t seen it yet, you should look at the stats provided here, there is some neat information. For example, in 2 player games your win% increases with the map size, but crashes for 64x64 maps! Weird!


Thanks for the stats! I wasn’t aware that such site exists. I will definitely check those 64x64 games - I am training currently on 32x32, 48x48 and 64x64, so it is rather surprising that it fails on games that it is trained on.

How successful do I think an RL bot could be? With proper resources (like maybe 16 or 32 GPUs and enough CPUs to feed them with games) it should get to the first page I guess. And possibly win, I don’t know really. With my resources (i7 7700k, 1080 Ti), I am hoping to maybe get into top 100, but that will probably require some different approach than plain PPO. I will most certainly experiment with Q-learning and probably with others. I used PPO kinda because “hey, OpenAI Five was PPO and it worked so it should work here too”, and it sort of does work, but since action space for every ship is so small, other approaches might be quite a bit faster.

My network is 14-layer CNN with residual connections and 5x5 kernels with 16 filters on most layers - It might be not enough or a complete overkill, I haven’t done any proper hyperparameter search yet. I wanted each ship to be able to see as much of the board as possible - and with 14 layers and 5x5 kernels ships have theoretical sight radius of 28 fields.

I am trying to avoid any sort of reward shaping. In the currently trained version I am trying to slowly include rewards of other ships in each ship’s reward, and subtract rewards of the enemies, but that naturally makes reward signal absurdly hard to understand for the network.

I don’t have any tips really, as it isn’t very sophisticated yet - except for using single network to control multiple agents, it’s a pretty normal PPO implementation. I can only say to not get discouraged and just steal as many solutions from Deepmind, OpenAI and other papers as possible.


Have you tried learning machine learning?


Hi MichalOp, thanks for sharing your great work here!

How did you generalize your model to different map sizes? The naive way is definitely training one bot for each map size, but I think there should be a better way to cope with this problem.

Thanks in advance if you can kindly provide any insights.


In previous iteration I was padding all boards to 64x64, but that wastes a lot of computing power and makes it harder to loop the board on edges. Currently I am simply running my network with inputs of different sizes - I can do that because I am using network consisting purely of convolution layers, which don’t care about spatial dimensions of the input. Keras and Tensorflow with eager execution handle that pretty nicely.


Hey MichalOP, thanks for the info you have provided about implementing RL in Halite. I was curious how you were going about handling the rewards for each step? Particularly, how do you assign a negative reward/done flag if a ship is destroyed? Or are you just using the total halite of the ship as the reward?


Here is a great resource on Machine Learning Concepts, the following are the topics explained in the article:

  • What is Machine Learning ?
  • What is Machine Learning Language ?
  • How does Machine Learning Work?
  • What is Machine Learning used for ?
  • What is the Difference between Artificial Intelligence and Machine Learning ?
  • Machine Learning vs Deep Learning
  • Types of Machine Learning Algorithms
  • The Best Machine Learning Algorithms
  • Most Used Programming Languages for Machine Learning
  • The Best Open Source Machine Learning Tools
  • Machine Learning Examples
  • Machine Learning Applications
  • What is the future of Machine Learning?
  • How can I learn Machine Learning?


thanks, i’m stuck around rank 75 with handcoded rules… with so many combination of battles and inspiration arrangements need to add some ML concepts to advance…

does anyone have ideas to map game features like ships,halite,dropoffs,etc into numerical values that can go into ML models? also what output to solve for hmm.


Hi. I’m also trying to build a bot with Deep Reinforcement Learning (DeepMind’s DQN). For now, I’m just trying to learn how to collect Halite (I use hardcoded rules for ship and dropoffs generation and to return to home).

In my case, I use a Neural Network to predict the expected amount of halite collected by a ship for each possible action (n, w, s, e, o; i.e., 5 outputs), given a very large state space (input). Typical Deep Reinforcement Learning scenario.

I extract two kinds of features given an allied ship. First, I create a map centring the action on the ship.


In the previous map, obstacles are in the red channel, my dropoffs are in the green channel and halite are in the blue channel.

Then, I extract different numerical features from the same ship, for example:

  • The distance and the direction vector to the closest dropoff.
  • The distance and the direction vector to the closest ‘k’ enemy and allied ships.
  • The amount of halite in the local area.
  • The amount of halite the ship has.

I normalize all the features to be in the range [0,1].

Finally, I create a NN with two inputs: a Convolutional layer (for the ‘image’ of the map) and a Dense layer for the remaining features. As I said, the NN has 5 outputs, one for each action.

Unfortunately, I don’t have a GPU, so is taking a while. I’m training the agent online while playing the game locally (I have to save some state between runs tho, e.g., the replay memory).

However, although it looks the agent is learning, I think that, given the dimension of the state space, it is going to take a looooong time.


Yes. I’m having the same problem with the reward signal. I think it is challenging to design the reward function to teach the agent (ship) efficiently to collect halite AND to return home.


My reward function is halite collected into a ship per turn, and only a 1% reward for actually returning it home. I don’t have any estimates for battles or inspiration though, so the ones fdlazgon mentioned seem along the right lines. Think I need to find the right multiplier for being at risk of enemy attack and also possible inspiration.

Doing it the computer go way and playing out out to the ultimate win/loss as the reward function its pretty much never going to work without using c++/gpu and simulating100k playouts and even then halite has even more branches than go.


@MichalOp i see you’ve made good progress on the leaderboard! with some sneaky inspiration management looks like you could reach the top 10.


@fdiazgon Hi i guess we can give negative reward for collision with other ships and positive rewards for reaching the dropoff and collecting more halite.Hope so it helps.


Hi there,
I’m looking at adding some ML on top of my current bot, and am wondering how you implemented:

Outputs for cells without ships are ignored (and masked when computing the loss).

Is that during the training phase, evaluation, or both?