AI for Tic-Tac-Toe (Part 3)

The last time, I have added many chalengers to play to Tic-Tac-Toe (link). Theses bots are able to detect and create traps. The last bot use a tree of differents cases. So it knows the bad choices and avoid them. All of the presented chalengers are non intelligent bots. Before I descript Neural Network, there are a last-minute chalenger. Also, its the first intelligent bot because it can learn from its mistakes.

6th competitor: 'The Bandit'

The bandit is based on the Multi-armed bandit problem.

A bandit is a slot machine. The player inserts coins in a machine, push a button and if he is lucky, he can win lot of money. But when the player go to casino, he have the choice with many machines. What is the machine with the better probability to win?

The best way is the observation of the behavior of the machine and the attribution of a score at each machine. If the player wins with a machine, so the score will be increassed. If the player losses, the score will be decreassed. Before each game, the player chooses the bandit with the best score.

In the case of the Tic-Tac-Toe chalenger, 'The Bandit' doesn't know the criters to win. So it tests and learns. If the bandit losses, that means each choice of the game is bad and will be avoided. For each choice, 'The Brute' uses a tag to know the best path and 'The Bandit' prefers to attribute a score.

The algorithm of 'The Brute' is more complex than the algorithm of the 'The Bandit' because programmer need to define win conditions and found a stategy to provides these informations. 'The Bandit' needs only a tree of choices with score and selects the more valuable path.

The score of the choices is increased or decreased depending of the result of a game. There are différent stategy to determine the gain and the loss: is winning more important than lossing?

Battle #15: 'The Bandit' vs 'Full Random'

The next chart corresponds to the results of battles after the learning period. The X-Axis is the ratio between the value of gains and losses. At left of the chart, the bot receive no gain and only losses to modify the score. So the bot want avoid to lose and prefers a tie-end. At right of the chart, the bot receive only gains and the bot want win and reduces the tie cases.

Strangely, the 'Full Random' is the most dangerous opponent of the bandit and a 'fifty-fifty' strategy is sufficient again the others veterans. This is due to the randomize aspect of the opponent: if 'Full Random' have the posibility to win, it may choose an another cell and lets the victory to 'The Bandit'. Nevertheless if the bot learns with another opponent, it has less difficulty to fight 'Full Random'.

Battle #16: 'The Bandit' vs 'Full Random' (2nd round)

Here 'Partial Random' is the master of 'The Bandit'. It has teached basis of Tic-Tac-Toe. 'The Bandit' have no difficulty to learn and it losses less than 400 times again 'Partial Random'. The only problem of the learning concerns the case where there are no loss: If 'The Bandit' forget than it can loss, so it has more probability to lose.

Now, 'The Bandit' is more efficient again 'Full Random'. I have also tested others masters. 'The Trapper' isn't a good teacher, the results are similar than the battle #15. Maybe it is too specialized in traps cases. The teaching of 'The Brute' reaches the same level than those of the 'Partial Random'. I think the 'Bandit' just needs to learn the rules of the game.

Battle #17: 'The Bandit V2' vs 'Full Random' (3nd round)

I have designed a second version of 'The Bandit' because I have found many problems in the algorithm.

First, a fixed value (gains or losses) is added to the weight and this may become very high. So I have placed limits in the precedent version. I want to test another solution with weight incrementation which consists to use an variable gains/losses. So the new weights tend toward limits without crossing the limits. I think this modification have few impact but it allow to keep an order in the weight. In the case of a fixed gains/losses and the use of limits, two ways may have the same value but have drive to lose with different probability. So now, the gains/losses corresponds to a percentage between current weight and a limit.

Secondly, it is possible to modify the weight if the result is tie. But if the weight become negative du to ties, it may become lower than a lost match and the way may be modified and create an instability with the result. Now, the weight of tie way tend to 0 instead of the negative limit.

Thirdly, The learning is long and the game contains lot of similar cases with transformation (symmetry or rotation). Now, each time a weight is modified, the weight of similar gameboard is modified to improve learning speed. This choice is due to the next competitor which is very special: it is not possible to prepare battles with thousand of games.

This chart corresponds to the case with the variable gains/losses and the accelerated learning. The boot seems more stable and efficient. When gains are superior to losses, the learning needs more games before stabilization. In the other cases, the learning is fast.

7th competitor: ‘The Homo Sapiens’

I have just said than this competitor is particular: it is an biological computer with lots of sensors and actuator. So if I want use it to play Tic-Tac-Toe, We need to create an adapted interface. A possible interface may be pen and paper, but this is adapted to a match Homo Sapiens vs. Home Sapiens and we want to compare this competitor to other bots. So, the interface is composed of a keyboard, a mouse and a screen. The screen displays a 3x3 grid. Each cell of the grid may have three different contents: empty, 'X' or 'O'. The Homo Sapiens needs to use the mouse to select a cell.

Interface between Homo Sapiens and TicTacToe application.
Interface between Homo Sapiens and TicTacToe application.

The user needs to select two competitors (here "Homo Sapiens" and "Full Random") and to click on the button "play". Now, competitors are beginning the battle.

Example of battle
Example of match

The application and some bots have parameters. The user needs to click on the buttons with "cog-wheel" icon.

TicTacToe Interface with parameters widget
TicTacToe Interface with parameters widget

"The Homo Sapiens" allows to understand the behavior of "The Bandit": In fact, it always plays in the same way. So, it is very no fun to play again "The bandit". Even if "The Brute" cannot lose, its randomize behavior is more interesting.

Conclusion

"The Bandit" uses a learning approach to play in TicTacToe. Rules of the game are not defined to the bot and if rules changes, "The Bandit" may adapt its choices.

Historically, The Homo Sapiens is the first player which had played Tic-Tac-Toe. Nevertheless, it needs more time to select a cell and its behavior may be strange: it may solve complex cases and make stupid mistakes, chooses to lose or refuse to play. So, comparisons of efficiencies are difficult.