QMIX+
- JUNG WOO MOON
- May 6, 2019
- 10 min read
Updated: May 13, 2019
Improving upon state of the art decentralized multi-agents in Starcraft II

Problem Statement and Background
Public interest in deep reinforcement learning has exploded in recent years. AlphaGo's victory over Go champion Lee Sedol in 2016 forged a new milestone in game AIs defeating the world’s very best players. More recently, OpenAI Five for the multiplayer online battle arena game Dota 2 and Google Deepmind's AlphaStar for the real time strategy game StarCraft II prove AI’s ability to learn superhuman long term, cooperative multi-agent game strategies. From OpenAI’s website, deep learning video game AIs are an initial step to eventually creating “general-purpose AI systems which start to capture the messiness and continuous nature of the real world, such as teamwork, long time horizons, and hidden information”.
Multi-agent reinforcement learning is an interesting and challenging subfield -- unlike single agents that optimize their personal rewards, multi-agents in cooperative settings optimize for the team’s reward, which may involve self-sacrifices, developing strategies, and so on. Multi-agents can be useful in any situation involving more than one AI, whether it’s a fleet of cars cooperating to move as effectively as possible, or network packet routing.
For our project, we decided to focus on improving the state of the art deep reinforcement learning AI for the game of StarCraft II. Playing entire games of StarCraft II involves resource collection, constructing and managing buildings and armies of units, and launching calculated attacks and responding to opponent strategies. However, since this would be too complex and computationally expensive, we decided to focus on a mini-challenge of StarCraft proposed in the StarCraft Multi-Agent Challenge (SMAC) Paper instead. The challenge focuses on the micromanagement aspect of battles in StarCraft II. The player fights an enemy built-in game AI, each with their own team of pre-determined units on a small map. The player must develop a strategy to out-“micro”manage their enemy with their own units and defeat all enemy units within a set period of timesteps. Each unit can act move and attack independently of other units. Each unit also only has partial observability of the environment, and cannot see information about enemy units outside of their sensor range. Thus, the scenario provides a challenging environment for multi-agent reinforcement learning, where the player AI must learn to manage their team of units in order to defeat the enemy. All the game observation data, available actions, and rewards will come from the StarCraft II game environment.
Success in these scenarios are measured by which group won the battle, i.e. which group has the last men standing. Our goal was to improve upon the baseline QMIX’s test time win rate (mean test win rate %) against the built-in AI in terms of training speed by number of timesteps and asymptotic win rate.
Approach
Data Preparation
Data was provided from the SMAC environment. For every frame of the episode, the environment provides each agent with a feature vector corresponding to their local observations. During training time, global state information is also available. Due to limited time and resources, we mainly developed on 2 scenarios out of 22 available:
10 marines vs. 11 marines (asymmetric)
3 stalkers and 5 zealots (symmetric)

Baseline Model
The baseline model is called QMIX (Rashid et al., 2018), which is the best performing model on SMAC. The figure below is a visual representation of the model architecture from the original paper.

Each agent (shown in green) that controls a unit is a Deep Recurrent Q-Network (Hausknecht & Stone, 2015), which represents the agent’s individual value function, Q_a. It takes as input o^a_t, the partial observation of the unit at timestep t, and u^a_t-1, the last action taken by the unit. The recurrent unit helps provide information that happened in previous timesteps.
The big picture is that we build a model that estimates the joint-action value function, Q_tot(t, u), through a complex non-linear combination of the individual Q_a functions. Our policy is then the argmax_u Q_tot(t, u), where u is the joint action of all agents. We use a feed-forward network to handle the non-linear combination -- this network is termed the “mixing network”, shown in blue in the figure above.
However, because our agents are decentralized, we have no access to the joint value function during testing. The authors of the paper extract the policy by ensuring the following:

If the above is true, then we do not need to evaluate Q values of all joint actions u for all the agents (an exponential search space), since we can instead just take the individual argmax action for each agent and get the same result.

The authors argued that (1) can be upheld if the mixing network was a monotonic function, which can be done by ensuring that the weights of the feed-forward network are strictly positive. Additionally, an ideal model would incorporate information about the game state as well. The authors came up with “hypernetworks” (shown in red) which take both of these considerations into account. Hypernetworks are also feed-forward networks which take the global state information as input, and output positive weights for the mixing network with an absolute value activation function (it also outputs matrices to be summed to output of each layer of the mixing network as shown in the figure).
The baseline model also uses RMSProp, a fixed learning rate of .0005, double Q-learning, and L2 loss. The target Q-network is updated every 200 timesteps. Epsilon is decayed from 1 to .05 linearly over 50k timesteps. Gamma is chosen as 0.99. A replay buffer contains 5,000 of the most recent episodes, and 32 episodes are sampled uniformly at random for each update step.
Our Model
For our model, we borrowed ideas and improvements from the “Rainbow” paper. Our initial goal was to implement two main improvements: Prioritized Experience Replay and Multi Step Learning.

Multi-Step Learning

A Q-value estimate in temporal-difference learning involves taking the current reward plus the discounted Q value estimate for the best action at the next state. However, this q-value estimate introduces significant bias into the system, since it does not reflect the real rewards obtained over the episode. With multi-step learning, we take the sum of discounted reward up to n steps in to the future before applying our q estimate at that state. Since we incorporate more of the actual rewards we saw, we decrease bias at the cost of introducing more variance, as the rewards may be acquired randomly through our epsilon-greedy policy. With this tradeoff we are typically able to achieve faster convergence in q-value estimations.
After adding multi-step learning to our model, we found that significant tuning was needed to choose the appropriate value of n for each scenario. We found that n = 5 worked best for the 10m_vs_11m scenario and n = 3 worked best for the 3s5z scenario.
Prioritized Experience Replay
One of the weaknesses of experience replay is that it samples from the buffer uniformly; if we were to prioritize and sample more frequently the more valuable experiences, training could improve significantly. The value of an experience is measured by the absolute TD error,

which roughly represents how much we can learn from it, since a large error signifies a big difference from what the current Q function knows and the estimated Q value from the sampled experience. The TD error is converted into a “priority”,

where epsilon is a small number preventing zeros, and into a proper probability through normalization and alpha, a hyperparameter that introduces randomness (0 = uniform sampling, 1 = only select experiences with highest priority)

However, priority sampling naturally introduces a bias towards experiences with high priority. This is mitigated by importance-sampling weights

which weighs each update accordingly. The hyperparameter beta is annealed from its initial value to 1 because “the unbiased nature of the updates is most important near convergence at the end of training” (Schaul, Quan, Antonoglou, Silver, 2016).
Implementation for prioritized experience replay was mainly adapted from the OpenAI Baseline code. The authors of the paper also took advantage of a data structure called a “sum-tree” to make sampling from and updating the buffer O(log N) time; this is included in the OpenAI repository.
Adding prioritized experience replay was detrimental to training the model, performing noticeably worse than the baseline. This likely was caused by improper implementation, as most papers utilizing prioritized experience replay show improved performance. Moreover, the PyMARL implementation of normal experience replay sampled episodes instead of transitions, which was a questionable detail that we encountered.
Other Improvements
We additionally tried many other attempts to improve the model. Some include creating a learning rate decay schedule (the baseline’s learning rate is fixed at .0005), using the Adam optimizer instead of RMSProp, using Huber loss instead of L2 loss, and introducing a regularization term. However, none of these attempts seemed to improve upon the baseline. We found that each change perhaps required significant hyperparameter tuning, which was difficult given the costly amount of time it takes to even begin seeing results on some environments.
We also attempted to modify the hyperparameter network. The initial plan was to concatenate the state information with the q-values instead of using hypernetworks. However, we found that introducing so many additional features made it harder for the mixer network to learn. Eventually, we decided to introduce minor changes to the QMIX mixer itself. One change involved replacing the absolute value function applied to the hypernetwork weights with the quadratic function (0.5x^2). We found performance of this replacement to be equal to using the absolute value on the 10m_vs_11m scenario and extraordinarily better on the 3s5z scenario. We theorize this is due to the fact that weight values tend to be low (< 1), and the quadratic function provides smoother gradients at these values than the absolute value function.
Results
10m_vs_11m

Models were trained from scratch directly on the 10m_vs_11m scenario. Every 5,000 training timesteps, 24 test runs are executed in which agents act greedily. Their win rate % against the enemy’s built-in AI is then recorded. The graph plots the mean of every 4 test runs (20,000 timesteps) and uses their variance to display a 95% confidence interval. The baseline model (QMIX) begins making significant training progress around 2 million timesteps in. Between 2 and 4 million timesteps the test time win rate fluctuates dramatically, and the model achieves asymptotic performance (>95% win rate) at around 7 million timesteps in.
We ran models with the only change being multi-step learning (n = 3 and 5). At n = 3, training was not particularly sped up compared to the baseline. However, variance in test win rate somewhat decreased during the 2-4 million timestep range, suggesting the policy the agents learn during this is more consistent against the enemy. This could be that the bias introduced by the target Q network estimations is being lowered from using 3 steps, so the learned policy is slightly more consistent. With n = 5 steps we saw tremendous improvements in training time and consistency in win rates, with significant improvements starting at 1 million timesteps, and asymptotic performance being achieved around 5 million timesteps (compared to 7 million in the baseline).
3s5z

Models were again trained from scratch on the 3s5z. The training and test methodology used was the same as in the 10m_vs_11m scenario. The baseline, as in the SMAC paper, shows significant difficulty in learning to win this scenario, training very slowly up to a maximum test time win rate of around 40% after 10 million timesteps. In the QMIX paper, the model eventually achieves asymptotic performance at only around an 80% win rate. Using multistep learning again (n = 3) along with replacing the absolute value function in the hypernetworks with the quadratic function, we saw tremendous improvement in learning time and in performance, with test time win rate making large advancements in under 1 million timesteps, and reaching >95% mean test win rate within 10 million timesteps. Attempts to use multistep learning only saw performance only similar to the baseline. The quadratic hypernetwork replacement was the main reason allowing for much faster learning and better asymptotic performance. Surprisingly, the observed loss values and temporal difference errors for the baseline were much lower than our model’s. This suggests that the baseline was overfitting on the rewards too much, which, early on, are given out based on dealing damage as opposed to winning the game. Thus by overfitting, the baseline model was learning too heavily to gather quick rewards as opposed to learning a long term game-winning strategy.
Tools
PySC2
This environment allows us to interact with StarCraft II’s server and also provides the scenarios we will use. This environment provides us all the state, reward, and action data we will need. State information includes locations and health information of our current units and currently observable enemy information. Actions for each unit include moving (up, down, left, right), staying in place, and attacking an enemy unit in range. Rewards are given upon winning a game, damaging enemy units, and defeating enemy units.
PyMARL
The Python Multi-Agent Reinforcement Learning framework built by the Whiteson Research Lab had implementations of QMIX and other models ready to go. The framework also allows us to train, load, and save models, as well as interact with the PySC2 environment. We used PyMARL to test the baseline performance, and to write our own models by editing the existing code.
PyTorch
The QMIX implementation in PyMARL was written with the machine learning library PyTorch, which made it the natural choice to work with.
Google Cloud Platform (GCP)
We used GCP to train our models on the cloud, since running the environment was computationally expensive (~10-18 hours per run). We used 2 Google Cloud Compute Engine instances with 8 CPUs each and 1 Nvidia K80 GPU each.
Lessons Learned
Model improvements were made from multi-step and hypernetwork replacements, specifically the quadratic function replacement. Our results show significant improvement on both tested scenarios after some hyperparameter tuning and experimentation.
Some of the main lessons we learned were that seeking out implementation issues is difficult. For example, the initial multistep learning implementation had an issue with handling masking, but the trained model based off this buggy implementation was still able to learn, albeit worse than the baseline. The model using prioritized experience replay also ended up performing worse, and it’s unclear whether this was due to an implementation issue as well.
Another lesson we learned was that good performance on one scenario does not necessarily translate to another. In order to gauge early on whether a model would perform, we tested the model on a simple scenario known as 3m (3 marines), where models could achieve >70% training time win rate within 100k timesteps. However, models performing close to or slightly better than the baseline scenario often did not do better on the 10m_vs_11m scenario. Training speed and asymptotic performance are hard to predict between scenarios.
Team Contributions
We only have 2 members in our team. We decided to split credit equally (50/50).
Christopher Qian - Worked on implementing multistep learning and various other improvements for the model. Handled the Google Cloud Computing resources, model training and experimentation, and data gathering. Assisted with the poster and writeup.
Jung Woo Moon - Worked on implementing prioritized experience replay. Created graph visualizations for the results and made most of the poster content. Worked on the final writeup and writeup website.
References
Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., and Whiteson, S. (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning.
Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2017) Rainbow: Combining improvements in deep reinforcement learning
Samvelyan, M., Rashid, T., de Witt, C., Farquhar, G., Nardelli, N., Rudner, T., Hung, C., Torr, P., Foerster, J., Whiteson, S. (2019) The Starcraft Multi-agent Challenge
Hausknecht, M. and Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. (2015)
Schaul, T., Quan, J., Antonoglou, I., Silver, D., Prioritized Experience Replay (2016)
Comments