Games have provided an amazing proving ground for developing strategic AI. The closed nature of games makes it easier to train models and develop solution techniques than in open ended systems. Games are clearly defined; the players are known and so are the payoffs. One of the biggest and earliest milestones was Deep Blue, the machine that beat the world champion in chess.
Early Milestones: Deep Blue
Deep Blue was a chess-playing supercomputer developed by IBM in the 1990s. As stated in the prologue, it made history in May 1997 by defeating the reigning world chess champion, Garry Kasparov, in a six-game match. Deep Blue utilized specialized hardware and algorithms capable of evaluating 200 million chess positions per second. It combined brute-force search techniques with heuristic evaluation functions, enabling it to search deeper into potential move sequences than any previous system. What made Deep Blue special was its ability to process vast numbers of positions quickly, effectively handling the combinatorial complexity of chess and marking a significant milestone in artificial intelligence.
However, as Gary Kasparov notes in his interview with Lex Fridman¹, Deep Blue was more of a brute force machine than anything else, so it’s perhaps hard to qualify it as any type of intelligence. The core of the search is basically just trial and error. And speaking of errors, it makes significantly less errors than humans, and according to Kasparov this is one of the features which made it hard to beat.
Advancements in Complex Games: AlphaGo
19 years after the Deep Blue victory in chess, a team from Google’s DeepMind produced another model that would contribute to a special moment in the history of AI. In 2016, AlphaGo became the first AI model to defeat a world champion go player, Lee Sedol.
Go is a very old board game with origins in Asia, known for its deep complexity and vast number of possible positions, far exceeding those in chess. AlphaGo combined deep neural networks with Monte Carlo tree search, allowing it to evaluate positions and plan moves effectively. The more time AlphaGo was given at inference, the better it performs.
The AI trained on a dataset of human expert games and improved further through self-play. What made AlphaGo special was its ability to handle the complexity of Go, utilizing advanced machine learning techniques to achieve superhuman performance in a domain previously thought to be resistant to AI mastery.
One could argue AlphaGo exhibits more intelligence than Deep Blue, given its exceptional ability to deeply evaluate board states and select moves. Move 37 from its 2016 game against Lee Sedol is a classic example. For those acquainted with Go, it was a shoulder hit at the fifth line and initially baffled commentators, including Lee Sedol himself. But as would later become clear, the move was a brilliant play and showcased how AlphaGo would explore strategies that human players might overlook and disregard.
Combining Chess and Go: AlphaZero
One year later, Google DeepMind made headlines again. This time, they took many of the learnings from AlphaGo and created AlphaZero, which was more of a general-purpose AI system that mastered chess, as well as Go and shogi. The researchers were able to build the AI solely through self-play and reinforcement learning without prior human knowledge or data. Unlike traditional chess engines that rely on handcrafted evaluation functions and extensive opening libraries, AlphaZero used deep neural networks and a novel algorithm combining Monte Carlo tree search with self-learning.
The system started with only the basic rules and learned optimal strategies by playing millions of games against itself. What made AlphaZero special was its ability to discover creative and efficient strategies, showcasing a new paradigm in AI that leverages self-learning over human-engineered knowledge.
Integrating Speed and Strategy: Star Craft II
Continuing its domination in the AI space, the Google DeepMind team changed its focus to a highly popular computer game, StarCraft II. In 2019 they developed an AI called AlphaStar² which was able to achieve Grandmaster level play and rank higher than 99.8% of human players on the competitive leaderboard.
StarCraft II is a real time strategy game that provided several novel challenges for the team at DeepMind. The goal of the game is to conquer the opposing player or players, by gathering resources, constructing buildings and amassing armies that can defeat the opponent. The main challenges in this game arise from the enormous action space that needs to be considered, the real-time decision making, partial observability due to fog of war and the need for long-term strategic planning, as some games can last for hours.
By building on some of the techniques developed for previous AIs, like reinforcement learning through self-play and deep neural networks, the team was able to make a unique game engine. Firstly, they trained a neural net using supervised learning and human play. Then, they used that to seed another algorithm that could play against itself in a multi-agent game framework. The DeepMind team created a virtual league where the agents could explore strategies against each other and where the dominant strategies would be rewarded. Ultimately, they combined the strategies from the league into a super strategy that could be effective against many different opponents and strategies. In their own words³:
The final AlphaStar agent consists of the components of the Nash distribution of the league — in other words, the most effective mixture of strategies that have been discovered — that run on a single desktop GPU.
Deep Dive into Pluribus and Poker
I love playing poker, and when I was living and studying in Trondheim, we used to have a weekly cash game which could get quite intense! One of the last milestones to be eclipsed by strategic AI was in the game of poker. Specifically, in one of the most popular forms of poker, 6-player no-limit Texas hold’em. In this game we use a regular deck of cards with 52 cards, and the play follows the following structure:
- The Preflop: All players are given 2 cards (hole cards) which only they themselves know the value of.
- The Flop: 3 cards are drawn and laid face up so that all players can see them.
- The Turn: Another card is drawn and laid face up.
- The River: A final 5th card is drawn and laid face up.
The players can use the cards on the table and the two cards on their hand to assemble a 5-card poker hand. For each round of the game, the players take turns placing bets, and the game can end at any of the rounds if one player places a bet that no one else is willing to call.
Though reasonably simple to learn, one only needs to know the hierarchy of the various poker hands, this game proved to be very difficult to solve with AI, despite ongoing efforts for several decades.
There are multiple factors contributing to the difficulty of solving poker. Firstly, we have the issue of hidden information, because you don’t know which cards the other players have. Secondly, we have a multiplayer setup with many players, with each extra player increasing the number of possible interactions and strategies exponentially. Thirdly, we have the no-limit betting rules, which allow for a complex betting structure where one player can suddenly decide to bet his entire stack. Fourth, we have an enormous game tree complexity due to the combinations of hole cards, community cards, and betting sequences. In addition, we also have complexity due to the stochastic nature of the cards, the potential for bluffing and the opponent modelling!
It was only in 2019 that a couple of researchers, Noam Brown and Tuomas Sandholm, finally cracked the code. In a paper published in Science, they describe a novel poker AI — Pluribus — that managed to beat the best players in the world in 6-player no-limit Texas hold’em.⁴ They conducted two different experiments, each consisting of a 10000 poker hands, and both experiments clearly showed the dominance of Pluribus.
In the first experiment, Pluribus played against 5 human opponents, achieving an average win rate of 48 mbb/game, with a standard deviation of 25 mbb/game. (mbb/game stands for milli big blind per game, how many big blinds is won per 1000 games played.) 48 mbb/game is considered a very high win rate, especially among elite poker players, and implies that Pluribus is stronger than the human opponents.
In the second experiment, the researchers had 5 versions of Pluribus play against 1 human. They set up the experiment so that 2 different humans would each play 5000 hands each against the 5 machines. Pluribus ended up beating the humans by an average of 32 mbb/game with a standard error of 15 mbb/game, again showing its strategic superiority.
The dominance of Pluribus is quite amazing, especially given all the complexities the researchers had to overcome. Brown and Sandholm came up with several smart strategies that helped Pluribus to become superhuman and computationally much more efficient than previous top poker AIs. Some of their techniques include:
- The use of two different algorithms for evaluating moves. They would first use a so called “blueprint strategy” which was created by having the program play against itself using a method called Monte Carlo counterfactual regret minimization. This blueprint strategy would be used in the first round of betting, but in subsequent betting rounds, Pluribus conducts a real-time search to find a better more granular strategy.
- To make its real-time search algorithm be more computationally efficient, they would use a dept-limited search and evaluate 4 different possible strategies that the opponents might choose to play. Firstly, they would evaluate each strategy for 2 moves ahead. In addition, they would only evaluate four different strategies for the opponents, including the original blueprint strategy, a blueprint strategy biased towards folding, a blueprint strategy biased towards calling and a final blueprint strategy biased towards raising.
- They also used various abstraction techniques to reduce the number of possible game states. For example, because a 9 high straight is fundamentally similar to a 8 high straight these can be viewed in a similar way.
- Pluribus would discretize the continuous betting space into a limited set of buckets, making it easier to consider and evaluate various betting sizes.
- In addition, Pluribus also balances its strategy in way that for any given hand it is playing, it would also consider other possible hands it could have in that situation and evaluate how it would play those hands, so that the final play would be balanced and thus harder to counter.
There are quite a few interesting observations to draw from Pluribus, but perhaps the most interesting is that it doesn’t vary its play against different opponents, but instead has developed a robust strategy that is effective against a wide variety of players. Since a lot of poker players think they have to adjust their play to various situations and people, Pluribus shows us that this is not needed and probably not even optimal, given how it beat all the humans it played against.
In our short foray into game theory, we noted that if you play the NE strategy in two-player zero-sum games you are guaranteed not to lose in expectation. However, for a multiplayer game like 6-player poker there is no such guarantee. Noam Brown speculates⁵ that it is perhaps the adversarial nature of a game like poker which still makes it suitable to try to approach it with a NE strategy. Conversely, in a game like Risk where players can cooperate more, pursuing a NE strategy is not guaranteed to work, because, if you are playing a risk game with 6 people, there is nothing you can do if your 5 opponents decide to gang up on you and kill you.
Evaluating the Trend in Strategic AI
Summarizing the history of strategic AI in games, we see a clear trend emerging. The games are slowly but surely becoming closer to the real-world strategic situations that humans find themselves in on an everyday basis.
Firstly, we are moving from a two-player to a multiplayer setting. This can be seen from the initial success in two-player games to multiplayer games like 6-player poker. Secondly, we are seeing an increase in the mastery of games with hidden information. Thirdly we are also seeing an increase in mastery of games with more stochastic elements.
Hidden information, multiplayer settings and stochastic events are the norm rather than the exception in strategic interactions among humans, so mastering these complexities is key in achieving a more general superhuman strategic AI that can navigate in the real world.