Exploration bias tuning in MCTS

Steve Draper
Global Moderator

Posts: 143

Exploration bias tuning in MCTS Dec 18, 2013 15:47:15 GMT -8

Quote

Post by Steve Draper on Dec 18, 2013 15:47:15 GMT -8

I recently found I had some unexpected losses in certain games on Tiltyard, notably Reversi. After some investigation I found a considerable gain in playing strength (in Reversi specifically) from tuning the exploration bias in the UCT selection formula down somewhat. As I had spent some time tuning this parameter prior to the Coursera competition this was not entirely expected, so I have since done a little more investigation to try to get a feel for the optimal setting across a range of games.

What I found was that there is no clear metric, which I am able to derive from meta-gaming simulation which correlates well with the empirical optimal value for this parameter (there are some that correlate well for MANY games, but with some outlying counter-examples). As such I'm currently testing a formula that works 'ok' for most games, and resigning myself to having to tune the parameter via learning (once I have the game isomorphism detection code to allow to yo persist data and apply it to the correct games).

What correlation there is seems to boil down to deep game trees responding better to smaller exploration bias. My theory is that in deep trees the 'signals' from the eventual terminal nodes are necessarily from longer, and therefore lower confidence, probes. Such lower confidence estimates of node value benefit more from a higher density of visits, which amounts to a bit less exploration across the tree (and hence lower exploration bias).

However, I'm wondering about better ways to dynamically tune this value, possibly even within the exploration of different branches of the game tree, and almost certainly across different roles in asymmetric games (e.g. - sheep & wolf, where I suspect applying different bias to sheep decision nodes than to wolf ones may be productive).

If my working theory about more diffuse signals from longer probe paths is correct, it suggests back propagating a 'distance to terminal state' measure up the MCTS tree, and perhaps varying the bias depending on this.

Anyone have any insights to offer in this area?

Andrew Rose
Global Moderator

Posts: 100

Exploration bias tuning in MCTS Dec 20, 2013 12:24:06 GMT -8

Quote

Post by Andrew Rose on Dec 20, 2013 12:24:06 GMT -8

Steve,

I think that understanding depth is going to be important. There's your example here. Then, from a previous email, I think you said that you've modified your simulation (or "default") policy to search for 1-move wins / losses and choose / avoid those - falling back to picking a random move. However, there's obviously no point in doing that in fixed depth games. It'll just slow you down. (Also, for games where you're very unlikely to see a win before move x, it might worth not doing your search for 1-move wins until you're at least at depth x.)

Another possible assistance to the problem you describe is giving unexplored moves a default score other than infinity, which might help in games with high branching factor and/or large depth.

levb
New Member

Posts: 18

Exploration bias tuning in MCTS Dec 20, 2013 13:13:53 GMT -8

Quote

Post by levb on Dec 20, 2013 13:13:53 GMT -8

I've thought of something else:
In MCTS, you model yourself in the future as making a random move weighted according to visit counts. I'm not sure this is always a good idea. Maybe if you decide that you have enough stats on the next move you can eliminate some of the possible moves. Especially if you think of a game like Pentago, where you have two moves in a row.

Steve Draper
Global Moderator

Posts: 143

Exploration bias tuning in MCTS Dec 20, 2013 16:49:11 GMT -8

Quote

Post by Steve Draper on Dec 20, 2013 16:49:11 GMT -8

Dec 20, 2013 12:24:06 GMT -8 Andrew Rose said:

Steve,

I think that understanding depth is going to be important. There's your example here. Then, from a previous email, I think you said that you've modified your simulation (or "default") policy to search for 1-move wins / losses and choose / avoid those - falling back to picking a random move. However, there's obviously no point in doing that in fixed depth games. It'll just slow you down. (Also, for games where you're very unlikely to see a win before move x, it might worth not doing your search for 1-move wins until you're at least at depth x.)

Another possible assistance to the problem you describe is giving unexplored moves a default score other than infinity, which might help in games with high branching factor and/or large depth.

Andrew, you're right, and in fact I do a fair bit of analysis to determine whether to enable or disable 'greedy rollouts' (which is how I termed them originally). I look at the following in making that determination:

1) Std deviation of game length (don't bother for games with low variance)
2) Measured effectiveness during simulation - I simulate with it turned on, and measure the actual number of terminals I detect that I wouldn't have otherwise (probabilistically estimating what I would have discovered based on observed density)

I also leave it disabled for 'odd' case games where I am uncertain how correct it would be (in particular simultaneous move games)

Steve Draper
Global Moderator

Posts: 143

Exploration bias tuning in MCTS Jan 5, 2014 14:57:08 GMT -8

Quote

Post by Steve Draper on Jan 5, 2014 14:57:08 GMT -8

Dec 20, 2013 13:13:53 GMT -8 levb said:

I've thought of something else:
In MCTS, you model yourself in the future as making a random move weighted according to visit counts. I'm not sure this is always a good idea. Maybe if you decide that you have enough stats on the next move you can eliminate some of the possible moves. Especially if you think of a game like Pentago, where you have two moves in a row.

It might be interesting to experiment with a non-linear exploitation value. For example - calculate the mean estimated score of all choices at a node, and then select based on the exploration value + exploitation value, but for the exploitation value, rather than using the child's estimated score, use a non-linear function (sigmoid maybe?) of the difference from the mean divided by the standard deviation - i.e. - make moves that are 2 standard deviations from the mean less than half as likely to be selected as those that are one standard deviation away. In a distribution with significant variance this would select those children that are more to the (expected) beneficial end of the spectrum mode than the standard linear criteria would.

wat
New Member

Posts: 32

Exploration bias tuning in MCTS Jun 9, 2014 18:33:08 GMT -8 jackcs likes this

Quote

Post by wat on Jun 9, 2014 18:33:08 GMT -8

UCT was originally designed for Computer Go, and default constants are all tuned for the Go game tree. One strong assumption in UCT is that playouts results follow a Bernoulli distribution (100 and 0 goal values only), assumption which don't hold in GGP for many games.

UCB1-Tuned on the other hand assumes a normal distribution, which is a lot better for GGP. Simply switching from UCT to UCB1-Tuned gives a huge increase in convergence speed for many games. And without hard-coding game-specific behaviour.

Statistics have a lot of other distributions as well. When tuning exploration/exploitation trade-off, you should pick a specific distribution which represents well the games you are tuning against, and make your search policy behave according to that distribution. The better the distribution model for a given game, the faster MCTS will converge.

Post by Steve Draper on Dec 18, 2013 15:47:15 GMT -8

Post by Andrew Rose on Dec 20, 2013 12:24:06 GMT -8

Post by levb on Dec 20, 2013 13:13:53 GMT -8

Post by Steve Draper on Dec 20, 2013 16:49:11 GMT -8

Post by Steve Draper on Jan 5, 2014 14:57:08 GMT -8

Post by wat on Jun 9, 2014 18:33:08 GMT -8

Quick Reply