dpoly
New Member
Posts: 16
|
Post by dpoly on Jun 28, 2017 3:36:04 GMT -8
I've just spent more time than I expected implementing a Monte Carlo Tree Search from scratch, using available references. I used the venerable Tic-Tac-Toe as a test bed, and was amazed how badly it played. Is it a bug? Is it the algorithm? What is going on here? Eventually I realised that (in effect) the player chooses a move on the assumption that the opponent will play randomly, and MCTS provides no means of avoiding moves that result in an immediate loss. In TTT terms, there can be an open two and it will not choose the blocking move, because other moves scored more highly on random play-outs. I solved my problem by detecting losing moves and overriding the weights for those cases, but it doesn't feel like a very general solution. I assume others have found better ways to solve the problem, but I haven't been able to find any references. My guess is that the solution might be to combine a deep play-out with a shallow minimax, but I'm not in a mood for original research if someone else can just tell me the answer!
|
|
|
Post by alandau on Jun 28, 2017 20:45:05 GMT -8
UCT should be able to play tic-tac-toe well (perfectly, given ~1200 rollouts per move decision, based on my research).
"[T]he player chooses a move on the assumption that the opponent will play randomly" -- this should not be the case. Are you actually building out a tree of node evaluations over time, or are you just using the simpler Monte Carlo method of collecting random results from one node of the tree?
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jun 29, 2017 3:49:57 GMT -8
Thanks for the response. I really would like to get this one nailed. The method I used is as per this reference: www.cameronius.com/research/mcts/about/index.html (but there are others similar). After (say) 20 iterations there should be 9 nodes at the first level and 11 at the second level, each of which has a sub-tree extending to a final position. After (say) 90 iterations there will be an average of 10 visits to each top node and a win/lose/draw result for each, aggregated into counts at the top level. After about 500 iterations most lines of play are sound, but interspersed are these awful moves. Each playout is random so it will contain awful moves: a missed immediate win or loss. It is perfectly possible for a branch to contain (say) 3 legal moves of which one is an immediate forced loss and two eventually result in forced wins; this branch gets a favorable count although it is guaranteed to lose against best play. More counts don't make it better. Is your research in the form of published code?
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Jun 29, 2017 9:38:44 GMT -8
I am a little confused by what you are saying. Going on a whim, you say "each playout is random" and "player chooses a move on the assumption that the opponent will play randomly" - are you always running the playouts from the root of the tree? The playout should not be random, as the tree part will select (mostly) greedily the best move with an exploration bias. When it reaches the expansion stage, then a random playout starts. I don't know if that is what is happening, but regardless TTT (tic tac toe) needs very little iterations (1200 as alandau has found) to play perfectly with a basic MCTS implementation. Another possibility is if your explorations bias is too high, the selection can look like it is random (since it basically explore the node with the least visits). Since you say chooses randomly for opponent this could be a bug with calculating the score of the node + exploration. Some further comments: 1. I suggest you work through this yourself, but if you need to look a simple MCTS implementation - there is one here in c++ github.com/richemslie/galvanise/blob/master/src/cpp/player/simplemcts/simplemcts.cpp (shameless plug ) 2. There is something called greedy playouts which will lookahead 1 step and look for a win during the playouts. Some players have a more advanced mode, which if a greedy playout is possible, it will backtrack the previous state and look for a simple defensive move. This can make great improvements to play in some games (such as connect 4 and breakthrough) without too much of a slow down. However, TTT is a very simple a game and this is complete overkill. MCTS will do fine on its own.
|
|
|
Post by alandau on Jun 29, 2017 21:12:07 GMT -8
Re: my "research": See alloyggp.blogspot.com/2017/03/uct-charts.html and the first graph in particular for tic-tac-toe. The code is not currently public; I built it in my player's codebase since that had a UCT implementation and fast rule engines ready to use. Looking at your more recent post, I might see the misunderstanding: "It is perfectly possible for a branch to contain (say) 3 legal moves of which one is an immediate forced loss and two eventually result in forced wins; this branch gets a favorable count although it is guaranteed to lose against best play. More counts don't make it better." Actually, more counts should make it better, because there is a sort of weighting going on (though I'm not accustomed to thinking of it as a weighting): the move with the opponent's forced win will be selected more often in the "selection" phase, and this will result in that win for the opponent being added more often to the results for that node in the backpropagation phase. As a result, the value for your player for that node will eventually decrease.
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jul 5, 2017 5:55:48 GMT -8
To rxe: by definition, play-out is always random. The sequence is selection (applying uct bias); expansion (of a random node); play out (totally random play to some end point). I have no reason to believe that my code is generating the wrong counts -- by hand-checking the counts all look fine. The question is: how does a set of counts get turned into winning moves (at best) and reliably avoid really bad losing moves?
How does a winning fraction of (say) 300/500 for a line of play equate to a forced win in one? Or perhaps a forced loss in one? What's the point of playing out hundreds of variations if one of the moves available is an immediate win? Or immediate loss?
Please remember: I'm really not trying to write a perfect TTT player. My aim is an MCTS that can play a range of games and hopefully generate reasonable (and non-stupid) moves even when there is not enough time for a deep analysis.
Thanks: I will check out greedy playouts and the code provided.
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jul 5, 2017 6:22:22 GMT -8
To alandau: the charts are interesting, but would be more useful if they came with win/lose/draw counts. My question is: at what level of iteration count does a play *never* lose? At what level does it *never* fail to take a forced win? When it fails, what does it do wrong?
I'm mostly working with about iteration counts of about 300-500 fot TTT because I want to see edge conditions that will arise on tougher games with insufficient time. What do you see happening at such lower counts? Dumb moves?
Re weighting: I'm not sure I understand the logic. If one of the available moves (during play) is an immediate loss (failure to block an open 2), there may still be several other moves open to the opponent in that line. The line of play as a whole may have a winning balance (most opponent moves will let us win) and there seems no way to avoid that line, and thereby losing immediately.
|
|
|
Post by alandau on Jul 5, 2017 21:01:08 GMT -8
There are a few keys to UCT's behavior when there's a misleading node like you described:
1) Once the tree of evaluation nodes includes the node with the misleading set of opponent options, the set of playouts going through that node are controlled by selection (the UCB formula), not the uniform randomness of the playout phase. That formula will cause most of the rollouts from that point on to go through the opponent's winning move, so that node will be evaluated as more and more bad for your player over time. (The UCB formula has some nice mathematical guarantees for the "multi-armed bandit" problem that boil down to "in the long run, we'll pick the option(s) with the highest expected value much, much more than we'll pick the others".)
2) The tree of evaluation nodes grows unevenly across the game tree, due to the differences in the selection stage. If there's a particular series of moves that it thinks is more likely to occur, the evaluation nodes will get deeper into that
3) For tic-tac-toe in particular, you'll be able to capture the entire game tree in the evaluation tree, at which point it's all selection and backpropagation. UCT will then gradually converge to the minimax values for all the game nodes, per the properties of UCB mentioned above.
4) For many games, in most states, the raw number you get from uniform randomness is a decent heuristic for the overall strategic value of positions, as opposed to the short-term tactical value, which gets handled by the selection phase as described above.
There are, of course, some games where UCT will just do an exceptionally poor job (Breakthrough Suicide being one example).
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jul 6, 2017 3:49:15 GMT -8
To rxe: after reading your code, I realise that you don't implement the algorithm the way it is in my references. For example, in jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/ you can see that the back propagation is the raw counts of wins vs visits, where what you do is back propagate a computed weight. Also, your code detects a node that is 'finalised' and treats it differently. This is the kind of adaptation that is needed to deal with early win/loss, but it's not described as part of the basic algorithm. I should point out that much of the literature deals with games that have high fan-out and do not have early termination, such as Go or Othello. Games like Connect-4 or Go-Moku would be expected to play badly unless they deal accurately with imminent win/loss. I wonder if you have any experience with those games and your player.
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jul 6, 2017 4:01:22 GMT -8
To alandau: yes, I understand that. I guess what I'm concerned about is games that are too big to analyse in depth but are subject to sudden death, like GoMoku and Connect-4. If my experience is genuine and MCTS can only play them well by playing out thousands of useless lines, and a modification is needed to compress lines that lead rapidly to sudden death (for either side).
My test setup is to play MCTS against a random opponent, or one that systematically works through all available moves. The overall behaviour is much as expected: MCTS picks good lines of play and wins a very high percentage. Nevertheless there is the the occasional loss and when I analyse that I find it picking suicidal moves that happen to come out high in the weighting because of other lines the opponent might have chose (but usually will not).
It seems to me that for games with the potential for sudden loss, MCTS needs to be augmented by a shallow search on the candidate move(s) to avoid stupid mistakes.
|
|
rxe
Junior Member
Posts: 61
|
Post by rxe on Jul 7, 2017 1:06:20 GMT -8
To rxe: after reading your code, I realise that you don't implement the algorithm the way it is in my references. For example, in jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/ you can see that the back propagation is the raw counts of wins vs visits, where what you do is back propagate a computed weight. Also, your code detects a node that is 'finalised' and treats it differently. This is the kind of adaptation that is needed to deal with early win/loss, but it's not described as part of the basic algorithm. Hi - the code I pointed you at actually doesn't do the kind of adaption to deal with early win/loss (sudden death games I like to call them.) Another player in the same repo does do a little more: github.com/richemslie/galvanise/blob/9d9a58eda1c4f8c585239ed86596f73bb41506c1/src/cpp/player/mcts/mcts.cppDuring selection it makes use of 'finalised' nodes. If a child is a 'loss', it will try to avoid those completely. If it is a 'win' it will take that straight away. See lines 180-200 A win is 100, a loss is 0. (or 1.0/0.0 since I divide by 100 to look more like a probability) Nodes are finalised as follows (for games where one player moves, and the other noops - ie non simultaneous games) * in terminal state. The score on the node is set to goal value for each player. * when calculating the score of the node during back prop: * if there is a finalised child that is a win for player with more than 1 moves, then mark that node is finalised and inherit the values of the child. * if all children are finalised - then set the score to max child score (which might unfortunately be 0 - a loss) See 420-470 for pertinent code. For sudden death games, a non-random rollout (I mentioned in previous post) makes a large difference. Regarding not storing wins on nodes. For ggp you want to store the expected values for each player. As games can be non - zero sum, simultaneous or/and include draws etc. The computed weight you see in the nodes is just the total score seen (which is goal value at the end of a random playout/rollout) divided by the number of visits. May I ask how are you dealing with draws in your implementation if you are only using wins? TTT ultimately ends up in a draw.
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jul 10, 2017 6:14:44 GMT -8
Thank you, that is precisely the issue. I found an excellent example: the Traffic Light variant of TTT (see Zillions or boardgamegeek.com/boardgame/1893/traffic-lights). Plain MCTS analyses a position with an open two red lights as advantageous, because every move except one allows an instant win, many of which show up in a random play out. Of course that one exception is the instant loss! The first para of chessprogramming.wikispaces.com/UCT refers to the same issue. My current player takes exactly that view! I'd be interested how yours does (I use ZRF, I don't know if there is a GDL for that game). Yes, the algorithm you describe is what I have now partially implemented, but still wrestling with weight propagation. I don't fully understand your explanation, but thanks for the link; hopefully the code will make it clear. I use football scoring: 2 points for win, 1 for a draw. The counts seem to work OK -- it's knowing when and how to convert them to weights that is the main problem.
|
|
dpoly
New Member
Posts: 16
|
Post by dpoly on Jul 13, 2017 5:53:34 GMT -8
Just a further note...with some input from Cameron Browne and this paper www.cameronius.com/cv/mcts-survey-master.pdf I was able to correct a subtle flaw in my implementation. The consequence is that it does now avoid awful moves, but not quickly. In the Traffic Light variant it takes around 300 iterations to find most good lines of play for any second move, but at least 500 and perhaps 1000 iterations to achieve weights that will avoid simple bad moves (such as B2 Red, A1 Red). In doing so it will visit and revisit those leaf nodes many times (at least 10% of all iterations in my testing). By comparison a simple minimax search will rule out all the really bad moves in a 2-ply search of just 81 iterations. So MCTS does work eventually, but to deal with "sudden death" games without wasting enormous amounts of processing it needs more, perhaps the ability to prune the game tree on finding an early leaf node. I don't currently have a reference to any work in this area, so would appreciate a pointer if anyone knows of it. [Trying to do it myself led to the initial flaw.]
|
|
wat
New Member
Posts: 32
|
Post by wat on Jul 13, 2017 16:22:08 GMT -8
In most games, MCTS needs tens of thousands of playouts per move to achieve half-decent play. And millions of playouts per move to achieve high-level play.
Doing a 1-ply or 2-ply search when simulating playouts might be viable. They are called *HEAVY PLAYOUTS*. But the additional overhead will drastically decrease the number of playouts per move and the result might be worse. And 0-ply random simulations are called light playouts.
Heavy playouts usually need some kind of heuristics to be effective, which is very non-trivial to do in GGP. So, "wasting" enormous amounts of processing is still an effective strategy in GGP in many cases.
There are other approaches, like *MCTS SOLVER*, which can quickly find and avoid shallow forced-loss variants and significantly improve the level of play in most games.
|
|
|
Post by Stephen on Mar 8, 2024 8:29:35 GMT -8
Hey, I doubt this thread will come back to life, but I have to ask, after reading all this - what "Subtle flaw" was corrected that fixed the problem? The link referenced is dead, so there's no hint of what the issue was. Quite disappointing, reaching the end with no answer, as op seems to describe what I'm facing.
|
|